[DISCUSS] Seeding and determinism on multi-gpu systems.

kellen sunderland Mon, 08 Jan 2018 09:27:31 -0800

Hello MXNet devs,

I wanted to see what people thought about the follow section of code, which
I think has some subtle pros/cons:
https://github.com/apache/incubator-mxnet/blob/d2a856a3a2abb4e72edc301b8b821f0b75f30722/src/resource.cc#L188


Tobi (tdomhan) from sockeye pointed it out to me after he spent some time
debugging non-determinism in his model training.

This functionality is well documented here:
https://mxnet.incubator.apache.org/api/python/ndarray.html#mxnet.random.seed
but I don't think the current api meets all use cases due to this section:

"Random number generators in MXNet are device specific. Therefore, random
numbers generated from two devices can be different even if they are seeded
using the same seed."

I'm guessing this is a feature that makes distributed training easier in
MXNet, you wouldn't want to train the same model on each GPU.  However the
downside of this is that if you run unit tests on a multi-gpu system, or in
a training environment where you don't have control over which GPU you use,
you can't count on deterministic behaviour which you can assert results
against.  I have a feeling there are non-unit test use cases where you'd
also want deterministic behaviour independent of which gpu you happen to
have your code scheduled to run on.

How do others feel about this?  Would it make sense to have some optional
args in the seed call to have the seed-per-device functionality turned off?

-Kellen

[DISCUSS] Seeding and determinism on multi-gpu systems.

Reply via email to