GitHub user ThomasLau opened a pull request:

    https://github.com/apache/spark/pull/13350

    fix: a forked process random extends parent's random state

    ## add random.seed() before the forked worker to run in daemon.py 
    here is a test code:
    
    ```python
    from random import random
    from operator import add
    def funcx(x):
      print x[0],x[1]
      return 1 if x[0]**2 + x[1]**2 < 1 else 0
    def genRnd(ind):
      x=random() * 2 - 1
      y=random() * 2 - 1
      return (x,y)
    def runsp(total):
      ret=sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(lambda 
x, y: x + y)/float(total) * 4
      print ret
    runsp(3)
    ```
    
    once i start the pyspark-shell, no matter how many times i run "runsp(N)"  
aafter, this code always print out 
    
    ```
    0.896083541418 -0.635625854075
    -0.0423532645466 -0.526910255885
    0.498518696049 -0.872983895832
    1.3333333333333333
    >>> 
sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) 
* 4
    0.896083541418 -0.635625854075
    -0.0423532645466 -0.526910255885
    0.498518696049 -0.872983895832
    1.3333333333333333
    >>> 
sc.parallelize(xrange(total),1).map(genRnd).map(funcx).reduce(add)/float(total) 
* 4
    0.896083541418 -0.635625854075
    -0.0423532645466 -0.526910255885
    0.498518696049 -0.872983895832
    1.3333333333333333
    ```
    
    i think this is because when we import  pyspark.worker in the daemon.py, we 
alse import a random by the shuffle.py which is imported by  pyspark.worker, 
this worker, forked by "pid = os.fork()", also remains the state of the 
parent's random, thus every forked worker get the same random.next().
    
    ## we need to re-random the random by random.seed, which will solve the 
problem, but i think this PR. may not be the proper fix.
    ths. 
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ThomasLau/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13350.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13350
    
----
commit 595abab29fb9dd5889885dd4cfd4676caa161601
Author: Thomas <[email protected]>
Date:   2016-05-27T03:52:59Z

    fix: a forked process random extends parent's random state

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to