Hi there, I have been using the multiprocessing module a lot to do statistical tests such as Monte Carlo or resampling, and I have just discovered something that makes me wonder if I haven't been accumulating false results. Given two files:
=== test.py === from test_helper import task from multiprocessing import Pool p = Pool(4) jobs = list() for i in range(4): jobs.append(p.apply_async(task, (4, ))) print [j.get() for j in jobs] p.close() p.join() === test_helper.py === import numpy as np def task(x): return np.random.random(x) ======= If I run test.py, I get: [array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.35773964, 0.63945684, 0.50855196, 0.08631373]), array([ 0.65357725, 0.35649382, 0.02203999, 0.7591353 ])] In other words, the 4 processes give me the same exact results. Now I understand why this is the case: the different instances of the random number generator where created by forking from the same process, so they are exactly the very same object. This is howver a fairly bad trap. I guess other people will fall into it. The take home message is: **call 'numpy.random.seed()' when you are using multiprocessing** I wonder if we can find a way to make this more user friendly? Would be easy, in the C code, to check if the PID has changed, and if so reseed the random number generator? I can open up a ticket for this if people think this is desirable (I think so). On a side note, there are a score of functions in numpy.random with __module__ to None. It makes it inconvenient to use it with multiprocessing (for instance it forced the creation of the 'test_helper' file here). Gaƫl _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion