[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

josepablocam Wed, 01 Jul 2015 20:45:38 -0700

Github user josepablocam commented on the pull request:

    https://github.com/apache/spark/pull/6994#issuecomment-117897128
  
    @mengxr Doing some testing yesterday I ran into some issues. I didn't 
realize that the pom.xml for Hadoop sets the math3 commons library to 3.1.1, 
rather than 3.4.1. The KolmogorovSmirnovTest from math3 in 3.4.1 is currently 
used to calculate the p-value of the statistic. The 
KolmgorovSmirnovDistribution, which is what is available in 3.1.1, is not good, 
as it hangs for any non-insignificant values of n (sample size). I tried to 
find a reasonable alternative yesterday/today, but the best I could come up 
with was doing away with the p-value and instead returning approximate critical 
values (calculated using an approximation formula for various significance 
levels at a given sample size, as in 
http://onlinelibrary.wiley.com/store/10.1002/9781119961260.app3/asset/app3.pdf?v=1&t=iblna6p5&s=72caab7fe494f67a317737f14dc045055aaae2bf).
 Another alternative might be to copy the 3.4.1 code for 1-sample distribution, 
but there is a decent bit of it. The 2-sample distribution code is short, s
 o I feel like there might be a way to take advantage of that to approximate 
the 1-sample distribution, but I am not familiar enough with that to say for 
sure. Any thoughts?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8598] [MLlib] Implementation of 1-sampl...

Reply via email to