[GitHub] spark pull request: [SPARK-10116] [core] XORShiftRandom.hashSeed i...

mengxr Fri, 28 Aug 2015 13:30:23 -0700

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/8314#issuecomment-135880899
  
    Some unit tests and python doctests do depend on the seed, more or less 
sensitive. I don't think requiring exact output is that bad because it can at 
least notify us changes in behavior. In Python, the doctest is used to generate 
documentation. It is useful to show actual output rather than checking the 
bounds, e.g., 
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L459.
    
    There is a trade-off between having meaningful probabilistic bounds vs. 
keeping unit tests small. For example, in Word2Vec, we can increase the 
training dataset size to reduce the variance of the model output and hence 
robust to random seed, but that increases the test time too.
    
    That being said, I can help make those tests less sensitive. Do you mind 
making JIRAs for each of them?
    
    Regarding @srowen 's question, if adding commons-math3 dependency is not an 
issue and its RNG performs similarly to the one here. I think we shouldn't 
maintain our own. However, I'm still a little worried about compatibility 
issues between commons-math3 releases.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-10116] [core] XORShiftRandom.hashSeed i...

Reply via email to