Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19929 @gatorsmile, yes, the reason why seed doesn't work is in the way Python UDFs are executed, i.e. a new python process is created for each partition to evaluate the Python UDF. Thus the seed is set only on the driver, but not in the process where the UDF is executed. What I am saying can be easily confirmed by this: ``` >>> from pyspark.sql.functions import udf >>> import os >>> pid_udf = udf(lambda: str(os.getpid())) >>> spark.range(2).select(pid_udf()).show() +----------+ |<lambda>()| +----------+ | 4132| | 4130| +----------+ >>> os.getpid() 4070 ``` Therefore there is no easy way to set the seed. If I set it inside the UDF, the UDF would become deterministic. Therefore I think that the best option is the current test.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org