Re: [Scikit-learn-general] Discrepancy in LogisticRegression on Windows vs. Linux with fixed random_state

Robert Kern Thu, 24 Apr 2014 02:54:07 -0700

On Thu, Apr 24, 2014 at 10:37 AM, Sturla Molden <sturla.mol...@gmail.com> wrote:
> Lars Buitinck <larsm...@gmail.com> wrote:
>
>>> - If you provide each thread with its own PRNG, you must make sure the
>>> sequences don't overlap. Just using a different seed for each thread is not
>>> safe either.
>>
>> I'm not sure what you mean by that;
>
> A PRNG will generate a seqence of pseudo-random numbers. Presumably you
> don't want overlapping sequences in your different threads, as it would
> constitute pseudo-sampling.
>
>> my rand(3) manpage says "In order
>> to get reproducible behavior in a threaded application, this  state
>> must be made explicit; this can be done using the reentrant function
>> rand_r()." But you're saying that's not enough?
>
> This manpage is plain wrong!
>
> The non-deterministic scheduling of threads means that multithreaded use of
> rand_r() will never be reproducible.


No, that's not true. If your threads don't interact with each other
(which is typical for many applications!), and each thread is using
its own rand_r() state (which is the point of using rand_r() as
opposed to rand()), then the multithreaded use of rand_r() certainly
can be reproducible. Just like using rand_r() in different processes
with their own seeded state can be reproducible. If the threads
interact non-trivially, then yeah, of course things won't be
reproducible because it wouldn't be reproducible even if a PRNG were
not involved. But that's not what the man page is saying.

> You cannot be sure that the kernel
> will make the threads call rand_r in the same pattern twice. In practice it
> will never happen. However, rand_r is reentrant, which is something very
> different from reproducible. However, in most other cases reentrant and
> threadsafe are equivalent, which might be why they think making a PRNG
> reentrant also makes it reproducible in a threaded application. It does
> not.
>
> In order for a PRNG to be reproducible in a threaded application, it must
> always deliver the same sequence to the n-th thread. That is a very hard
> requirement to satisfy.
>
> The DC Mersenne Twister solves this

No, the DC Mersenne Twister solves the *independence* problem by the
following scheme, not the reproducibility problem ("this").

> by encoding  thread identifiers into
> the charcteristic polynomials in such a way that they are "relatively prime
> to each other". That means that each thread gets an independent stream of
> random numbers. Since there is one Mersenne Twister object per thread the
> kernel's thread scheduling is eliminated as an additional source or
> randomness.

If the thread scheduling is a problem for rand_r(), it will be a
problem for the DC Mersenne Twister.

-- 
Robert Kern

------------------------------------------------------------------------------
Start Your Social Network Today - Download eXo Platform
Build your Enterprise Intranet with eXo Platform Software
Java Based Open Source Intranet - Social, Extensible, Cloud Ready
Get Started Now And Turn Your Intranet Into A Collaboration Platform
http://p.sf.net/sfu/ExoPlatform
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Discrepancy in LogisticRegression on Windows vs. Linux with fixed random_state

Reply via email to