Github user josepablocam commented on the pull request:
https://github.com/apache/spark/pull/6994#issuecomment-117897128
@mengxr Doing some testing yesterday I ran into some issues. I didn't
realize that the pom.xml for Hadoop sets the math3 commons library to 3.1.1,
rather than 3.4.1. The KolmogorovSmirnovTest from math3 in 3.4.1 is currently
used to calculate the p-value of the statistic. The
KolmgorovSmirnovDistribution, which is what is available in 3.1.1, is not good,
as it hangs for any non-insignificant values of n (sample size). I tried to
find a reasonable alternative yesterday/today, but the best I could come up
with was doing away with the p-value and instead returning approximate critical
values (calculated using an approximation formula for various significance
levels at a given sample size, as in
http://onlinelibrary.wiley.com/store/10.1002/9781119961260.app3/asset/app3.pdf?v=1&t=iblna6p5&s=72caab7fe494f67a317737f14dc045055aaae2bf).
Another alternative might be to copy the 3.4.1 code for 1-sample distribution,
but there is a decent bit of it. The 2-sample distribution code is short, s
o I feel like there might be a way to take advantage of that to approximate
the 1-sample distribution, but I am not familiar enough with that to say for
sure. Any thoughts?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]