Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/8314#issuecomment-135880899
Some unit tests and python doctests do depend on the seed, more or less
sensitive. I don't think requiring exact output is that bad because it can at
least notify us changes in behavior. In Python, the doctest is used to generate
documentation. It is useful to show actual output rather than checking the
bounds, e.g.,
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L459.
There is a trade-off between having meaningful probabilistic bounds vs.
keeping unit tests small. For example, in Word2Vec, we can increase the
training dataset size to reduce the variance of the model output and hence
robust to random seed, but that increases the test time too.
That being said, I can help make those tests less sensitive. Do you mind
making JIRAs for each of them?
Regarding @srowen 's question, if adding commons-math3 dependency is not an
issue and its RNG performs similarly to the one here. I think we shouldn't
maintain our own. However, I'm still a little worried about compatibility
issues between commons-math3 releases.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]