On Apr 15, 2011, at 5:51 PM, Gregory Magoon wrote:
> For what it's worth, I value reproducibility for conformer generation,
> and I use the same random seed every time I run it. Out of curiosity,
> what exactly is the issue with minstd_rand?

First, it's a linear congruential algorithm. See

http://en.wikipedia.org/wiki/Linear_congruential_generator

     LCGs should not be used for applications where high-quality
     randomness is critical. For example, it is not suitable for
     a Monte Carlo simulation because of the serial correlation
     (among other things).

    if an LCG is used to choose points in an n-dimensional space,
     the points will lie on, at most, m**(1/n) hyperplanes.

In other words, it's unlikely to be a good source of randomness for  
conformation generation.

Second, it has a cycle length of 2**31-2. This means that after about  
2 billion RNGs it will return to the same sequence. Using a Mersenne  
Twister (a better but more computationally expensive algorithm) on my  
laptop takes 0.0371 usec per call, so can deplete 2**31 values in a  
bit over 1 minute.

You almost certainly don't generate 2 billion conformations of the  
same structure, so you probably think that isn't a problem. However,  
that's going to depend on how the random numbers are used. For  
example, are 3 RNGs used for each atom? If so, then there's only a  
million conformations before you get repeats. If you are unlucky then  
you might even loop through the RNG cycle at some multiple of the  
number of times you use the RNG, so that successive calculations give  
no new information.


Third, it takes a seed of size only 2**31. If you use random seeds  
then you would expect that by the time you have done about 55K  
generations then you'll have a 50% chance of having an exact duplicate  
conformation. This is the so-called birthday "paradox" or birthday  
problem.

   http://en.wikipedia.org/wiki/Birthday_problem

It's almost certain that you don't expect or want a duplication, and  
your analysis statistics don't factor that into account.

It's therefore better to use an RNG which can (and by default does)  
work from a larger initialization vector.


                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
Benefiting from Server Virtualization: Beyond Initial Workload 
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve 
application availability and disaster protection. Learn more about boosting 
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to