Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/19929
  
    @gatorsmile, yes, the reason why seed doesn't work is in the way Python 
UDFs are executed, i.e. a new python process is created for each partition to 
evaluate the Python UDF. Thus the seed is set only on the driver, but not in 
the process where the UDF is executed. What I am saying can be easily confirmed 
by this:
    ```
    >>> from pyspark.sql.functions import udf
    >>> import os
    >>> pid_udf = udf(lambda: str(os.getpid()))
    >>> spark.range(2).select(pid_udf()).show()
    +----------+                                                                
    
    |<lambda>()|
    +----------+
    |      4132|
    |      4130|
    +----------+
    >>> os.getpid()
    4070
    ```
    Therefore there is no easy way to set the seed. If I set it inside the UDF, 
the UDF would become deterministic. Therefore I think that the best option is 
the current test.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to