GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/19929

    [SPARK-22629][PYTHON] Add deterministic flag to pyspark UDF

    ## What changes were proposed in this pull request?
    
    In SPARK-20586 the flag `deterministic` was added to Scala UDF, but it is 
not available for python UDF. This flag is useful for cases when the UDF's code 
can return different result with the same input. Due to optimization, duplicate 
invocations may be eliminated or the function may even be invoked more times 
than it is present in the query. This can lead to unexpected behavior.
    
    This PR adds the deterministic flag, via the `asNondeterministic` method, 
to let the user mark the function as non-deterministic and therefore avoid the 
optimizations which might lead to strange behaviors.
    
    ## How was this patch tested?
    
    Manual tests:
    ```
    >>> from pyspark.sql.functions import *
    >>> from pyspark.sql.types import *
    >>> df_br = spark.createDataFrame([{'name': 'hello'}])
    >>> import random
    >>> udf_random_col =  udf(lambda: int(100*random.random()), 
IntegerType()).asNondeterministic()
    >>> df_br = df_br.withColumn('RAND', udf_random_col())
    >>> random.seed(1234)
    >>> udf_add_ten =  udf(lambda rand: rand + 10, IntegerType())
    >>> df_br.withColumn('RAND_PLUS_TEN', udf_add_ten('RAND')).show()
    +-----+----+-------------+                                                  
    
    | name|RAND|RAND_PLUS_TEN|
    +-----+----+-------------+
    |hello|   3|           13|
    +-----+----+-------------+
    
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-22629

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19929.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19929
    
----
commit 6187d5a0df7c409a49cd636eb74dea9323044c6b
Author: Marco Gaido <[email protected]>
Date:   2017-12-08T20:20:25Z

    [SPARK-22629][PYTHON] Add deterministic flag to pyspark UDF

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to