Github user mattf commented on the pull request:
https://github.com/apache/spark/pull/2313#issuecomment-56129891
that's a very good point, especially about how it's an unsolved problem in
general, at least on our existing operating systems. iirc, systems like plan9
tried to address complete reproducibility, but i may be misremembering the
specifics.
the four stated cases are:
a) driver w/ numpy, worker w/ numpy - numpy used, no message emitted
b) driver w/ numpy, worker w/o numpy - numpy used on driver, not used on
workers, warning emitted
c) driver w/o numpy, worker w/ numpy - numpy not used on driver nor
worker, no message emitted
d) driver w/o numpy, worker w/o numpy - numpy not used on driver nor
worker, no message emitted
case (a) is not a concern because numpy is used consistently throughout
case (b) is not a concern because python random is used consistely throught
the workers
case (c) and (d) are not a concern because pythons random module are used
throughout
however, there's a fifth case:
d) driver w/ numpy, some workers w/ numpy, some workers w/o numpy
there's actually a sixth case, but it's intractable for spark and shouldn't
be considered: different implementations of python random or numpy's random
across workers. this is something that should be managed outside of spark.
in (d), some workers will use numpy and others will use random. previously,
all workers w/o numpy would error out, potentially terminating the computation.
now, a warning will be emitted (though it'll be emitted to /dev/null) and
execution will complete.
i'd solve this with a systems approach: remove the python random code and
require numpy to be present, or remove the numpy code. and, i'd lean toward
using the faster code (numpy). however, that might not be palatable for the
project. if it is, i'm more than happy to redo scrap this ticket and create
another to simplify the RDDSampler.
as i see it, to proceed we evaluate -
```
if acceptable to require numpy:
matt.dothat()
else:
if acceptable to potentially compromise re-computability w/ warning:
commit this
else:
scrap this
```
(i'm left out the case where we decide to simply by always using the slower
python code, because i'd rather not trade off performance to avoid an error
message and i think adding a numpy dep is straightforward)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]