On Fri, Mar 02, 2012 at 10:36:14AM +0100, Karl Forner wrote: [...] > Hello, > I would be also in favor for using multiple seeds based on (seed, > task_number) for convenience (i.e. avoiding storing the seeds) > and with the possibility of having a dynamic number of tasks, but I am mot > sure it is theoretically correct. > But I can refer you to this article: > http://www.agner.org/random/ran-instructions.pdf , section 6.1 > where the author states: > > For example, if we make 100 streams of 10^10 random numbers each from an > > SFMT > > generator with cycle length ρ = 2^11213, we have a probability of overlap > > p ≈ 10^3362. > > > > What do you think ? I am very concerned by the correctness of this approach > so would appreciate any advice on that matter.
Hi. First, let us consider a theoretical scenario, where the starting points are chosen independently and from the uniform distribution. Then the above should be OK. SFMT supports several different periods, one of them is 2^11213-1. If we choose 100 random starting points in the cycle of this length and consider intervals of length 10^10 each, then two of them overlap with probability (2 * 10^10 - 1) / (2^11213 - 1) which is approx 10^-3365. An upper bound on the probability that any of the 100 overlap is choose(100, 2) * (2 * 10^10 - 1) / (2^11213 - 1) which is still negligible and approx 10^-3362 as you say except of a typo in the sign. The crucial assumption is that we choose the starting points from the uniform distribution over the period. Every point on the period cycle is represented by a different internal state of the generator. So, uniform distribution on the period is equivalent to the uniform distribution on admissible states. An admissible state is represented by 11213 bits, which are not all 0. The probability of all 0 is extremely small. So, in order to generate one such state, it is sufficient to fill the state array with uniform i.i.d. bits. This is suitable only as a theoretical model, since this can guarantee reproducibility only if we store the whole state. In order to guarantee reproducibility with simpler means, we cannot fill the initial state with uniform i.i.d. random bits. So, we do not have a uniform distribution over the period, but we can have different approximations to it. A cryptographically strong hash function is such an approximation in the sense of computational indistinguishability. It does not and also cannot approximate the uniform distribution in the statistical sense, since the set of possible outputs is much smaller than the period. However, if someone gives us two initial states, one from the hash function and the other created from uniform i.i.d. bits, then there is no known computational method to distinguish these two in moderate CPU time. So, you can safely use the output of the hash function for simulations. In fact, the requirements of simulations are weaker. For example, MD5 cannot be used for cryptography any more, since there are known algorithms to break it. However, if you use it for a simulation, then the simulation will be biased only if it contains an algorithm, which breaks MD5. The probability that this happens just by chance is small. Petr Savicky. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel