[GitHub] spark pull request: [SPARK-8884] [MLlib] 1-sample Anderson-Darling...

josepablocam Wed, 08 Jul 2015 10:47:48 -0700

Github user josepablocam commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7278#discussion_r34176727
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala 
---
    @@ -158,4 +158,32 @@ object Statistics {
       def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
         ChiSqTest.chiSquaredFeatures(data)
       }
    +
    +  /**
    +   * Conduct a 1-sample Anderson-Darling test for the null hypothesis that 
the data
    +   * comes from a given theoretical distribution. The Anderson-Darling 
test is an alternative
    +   * to the Kolmogorov-Smirnov test, and is more adequate at identifying 
departures from the
    +   * theoretical distribution at the tails. The implementation returns an 
`ADTestResult`, which
    +   * includes the AD statistic, the critical values at varying 
significance levels, and
    +   * the null hypothesis. Note that the critical values are calculated 
assuming the parameters
    +   * have been calculated from the data sample. If the parameters for the 
theoretical distribution
    +   * are not in a valid domain, throws an exception.
    +   * @param data `RDD[Double]` the data to be test
    +   * @param distName `String` name of the theoretical distribution to test 
against. Currently
    +   *                supports Normal ("norm"), Exponential ("exp"), Gumbel 
("gumbel"),
    +   *                Logistic ("logistic"), and Weibull ("weibull") 
distributions
    +   * @param params A series of optional parameters providing the 
parameters for the theoretical
    +   *               distribution. Only the Normal and Exponential 
distributions support
    +   *               direct estimation of parameters, all others require 
that the user provide them.
    +   *               The order of parameters are as follow
    +   *               Normal -> [mu, sigma] (location, scale)
    +   *               Exponential -> [1 / lambda] (scale)
    +   *               Gumbel -> [mu, beta] (location, scale)
    +   *               Logistic -> [mu, s] (location, scale)
    +   *               Weibull -> [lambda, k]  (scale, shape)
    +   * @return `ADTestResult`
    +   */
    +  def adTest(data: RDD[Double], distName: String, params: Double*): 
ADTestResult = {
    --- End diff --
    
    Sure. I've changed that now. I initially went with adTest as a parallel to 
R's ad.test in the nortest library. SciPy calls their AD test in stats 
anderson, which seems a bit mean to Darling.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8884] [MLlib] 1-sample Anderson-Darling...

Reply via email to