[
https://issues.apache.org/jira/browse/SPARK-11438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14986377#comment-14986377
]
Antonio Piccolboni commented on SPARK-11438:
--------------------------------------------
The problem with nondeterminism is that it combines poorly with a computing
model were retries are allowed. In fact, it allows programs that should fail to
return incorrect results. Imagine a UDF rnorm(mu, sigma) that returns samples
from a normal distribution. Imagine that the larger program containing it fails
when the sample returned is in the top 10-percentile. If enough fault tolerance
is built in, the program will terminate correctly but rnorm will sample from a
new distribution that's like a normal but truncated at the top 90th percentile
and renormalized. If thinking about a continuous distribution hampers
intuition, imagine a dice-simulating UDF, and a program that returns the
average of many throws. Imagine the program or the UDF itself fails when the
sampled value is or should be 1. The returned average will be approximately 4
instead of 3.5. In light of this, I don't think this feature should be added.
> Allow users to define nondeterministic UDFs
> -------------------------------------------
>
> Key: SPARK-11438
> URL: https://issues.apache.org/jira/browse/SPARK-11438
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Reporter: Yin Huai
> Assignee: Yin Huai
>
> Right now, all UDFs are deterministic. It will be great if we allow users to
> define nondeterministic UDFs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]