Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/6994#discussion_r33191664
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
    @@ -158,4 +158,44 @@ object Statistics {
       def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
         ChiSqTest.chiSquaredFeatures(data)
       }
    +
    +  /**
    +   * Conduct a one-sample, two sided Kolmogorov Smirnov test for 
probability distribution equality
    +   * @param data an `RDD[Double]` containing the sample of data to test
    +   * @param cdf a `Double => Double` function to calculate the theoretical 
CDF at a given value
    +   * @return KSTestResult object containing test statistic, p-value, and 
null hypothesis.
    +   */
    +  def ksTest(data: RDD[Double], cdf: Double => Double): KSTestResult = {
    +    KSTest.testOneSample(data, cdf)
    +  }
    +
    +  /**
    +   * Conduct a one-sample, two sided Kolmogorov Smirnov test for 
probability distribution equality,
    +   * which creates only 1 distribution object per partition (useful in 
conjunction with Apache
    +   * Commons Math distributions)
    +   * @param data an `RDD[Double]` containing the sample of data to test
    +   * @param distCalc a `Iterator[(Double, Double, Double)] => 
Iterator[Double]` function, to
    +   *                 calculate the distance between empirical values and 
theoretical values of
    +   *                 a distribution. The first element corresponds to the 
value x, the second
    +   *                 element is the lower bound of the empirical CDF, 
while the third element is
    +   *                 the upper bound. Thus if we call triple associated 
with an observation T, the
    +   *                 KS distance at that point is max(Pr[X <= T._1] - 
T._2, T._3 - Pr[X <= T._1])
    +   * @return KSTestResult object containing test statistic, p-value, and 
null hypothesis.
    +   */
    +  def ksTestOpt(data: RDD[Double],
    --- End diff --
    
    I think I'd leave this API out on the first pass, as, while there are 
definitely situations where it's useful, it's likely to be confusing to users.  
We can always add it in later if there's demand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to