GitHub user ThomasLau opened a pull request:
https://github.com/apache/spark/pull/13350
fix: a forked process random extends parent's random state
## add random.seed() before the forked worker to run in daemon.py
here is a test code:
```python
from random import random
from operator import add
def funcx(x):
print x[0],x[1]
return 1 if x[0]**2 + x[1]**2 < 1 else 0
def genRnd(ind):
x=random() * 2 - 1
y=random() * 2 - 1
return (x,y)
def runsp(total):
ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda
x, y: x + y)/float(total) * 4
print ret
runsp(3)
```
once i start the pyspark-shell, no matter how many times i run "runsp(N)"
aafter, this code always print out
```
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.3333333333333333
>>>
sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
* 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.3333333333333333
>>>
sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total)
* 4
0.896083541418 -0.635625854075
-0.0423532645466 -0.526910255885
0.498518696049 -0.872983895832
1.3333333333333333
```
i think this is because when we import pyspark.worker in the daemon.py, we
alse import a random by the shuffle.py which is imported by pyspark.worker,
this worker, forked by "pid = os.fork()", also remains the state of the
parent's random, thus every forked worker get the same random.next().
## we need to re-random the random by random.seed, which will solve the
problem, but i think this PR. may not be the proper fix.
ths.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ThomasLau/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13350.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13350
----
commit 595abab29fb9dd5889885dd4cfd4676caa161601
Author: Thomas <[email protected]>
Date: 2016-05-27T03:52:59Z
fix: a forked process random extends parent's random state
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]