Github user mattf commented on the pull request:

    https://github.com/apache/spark/pull/2313#issuecomment-56129891
  
    that's a very good point, especially about how it's an unsolved problem in 
general, at least on our existing operating systems. iirc, systems like plan9 
tried to address complete reproducibility, but i may be misremembering the 
specifics.
    
    the four stated cases are:
     a) driver w/ numpy, worker w/ numpy - numpy used, no message emitted
     b) driver w/ numpy, worker w/o numpy - numpy used on driver, not used on 
workers, warning emitted
     c) driver w/o numpy, worker w/ numpy - numpy not used on driver nor 
worker, no message emitted
     d) driver w/o numpy, worker w/o numpy - numpy not used on driver nor 
worker, no message emitted
    
    case (a) is not a concern because numpy is used consistently throughout
    case (b) is not a concern because python random is used consistely throught 
the workers
    case (c) and (d) are not a concern because pythons random module are used 
throughout
    
    however, there's a fifth case:
     d) driver w/ numpy, some workers w/ numpy, some workers w/o numpy
    
    there's actually a sixth case, but it's intractable for spark and shouldn't 
be considered: different implementations of python random or numpy's random 
across workers. this is something that should be managed outside of spark.
    
    in (d), some workers will use numpy and others will use random. previously, 
all workers w/o numpy would error out, potentially terminating the computation. 
now, a warning will be emitted (though it'll be emitted to /dev/null) and 
execution will complete.
    
    i'd solve this with a systems approach: remove the python random code and 
require numpy to be present, or remove the numpy code. and, i'd lean toward 
using the faster code (numpy). however, that might not be palatable for the 
project. if it is, i'm more than happy to redo scrap this ticket and create 
another to simplify the RDDSampler.
    
    as i see it, to proceed we evaluate -
    
    ```
    if acceptable to require numpy:
       matt.dothat()
    else:
       if acceptable to potentially compromise re-computability w/ warning:
          commit this
       else:
          scrap this
    ```
    
    (i'm left out the case where we decide to simply by always using the slower 
python code, because i'd rather not trade off performance to avoid an error 
message and i think adding a numpy dep is straightforward)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to