[ 
https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14986377#comment-14986377
 ] 

Antonio Piccolboni commented on SPARK-11438:
--------------------------------------------

The problem with nondeterminism is that it combines poorly with a computing 
model were retries are allowed. In fact, it allows programs that should fail to 
return incorrect results. Imagine a UDF rnorm(mu, sigma) that returns samples 
from a normal distribution. Imagine that the larger program containing it fails 
when the sample returned is in the top 10-percentile. If enough fault tolerance 
is built in, the program will terminate correctly but rnorm will sample from a 
new distribution that's like a normal but truncated at the top 90th percentile 
and renormalized. If thinking about a continuous distribution hampers 
intuition, imagine a dice-simulating UDF, and a program that returns the 
average of many throws. Imagine the program or the UDF itself fails when the 
sampled value is  or should be 1. The returned average will be approximately 4 
instead of 3.5. In light of this, I don't think this feature should be added.

> Allow users to define nondeterministic UDFs
> -------------------------------------------
>
>                 Key: SPARK-11438
>                 URL: https://issues.apache.org/jira/browse/SPARK-11438
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Yin Huai
>            Assignee: Yin Huai
>
> Right now, all UDFs are deterministic. It will be great if we allow users to 
> define nondeterministic UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to