Github user mgaido91 commented on the issue:
https://github.com/apache/spark/pull/19929
@gatorsmile, yes, the reason why seed doesn't work is in the way Python
UDFs are executed, i.e. a new python process is created for each partition to
evaluate the Python UDF. Thus the seed is set only on the driver, but not in
the process where the UDF is executed. What I am saying can be easily confirmed
by this:
```
>>> from pyspark.sql.functions import udf
>>> import os
>>> pid_udf = udf(lambda: str(os.getpid()))
>>> spark.range(2).select(pid_udf()).show()
+----------+
|<lambda>()|
+----------+
| 4132|
| 4130|
+----------+
>>> os.getpid()
4070
```
Therefore there is no easy way to set the seed. If I set it inside the UDF,
the UDF would become deterministic. Therefore I think that the best option is
the current test.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]