Github user josepablocam commented on a diff in the pull request:
https://github.com/apache/spark/pull/7278#discussion_r34176727
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala
---
@@ -158,4 +158,32 @@ object Statistics {
def chiSqTest(data: RDD[LabeledPoint]): Array[ChiSqTestResult] = {
ChiSqTest.chiSquaredFeatures(data)
}
+
+ /**
+ * Conduct a 1-sample Anderson-Darling test for the null hypothesis that
the data
+ * comes from a given theoretical distribution. The Anderson-Darling
test is an alternative
+ * to the Kolmogorov-Smirnov test, and is more adequate at identifying
departures from the
+ * theoretical distribution at the tails. The implementation returns an
`ADTestResult`, which
+ * includes the AD statistic, the critical values at varying
significance levels, and
+ * the null hypothesis. Note that the critical values are calculated
assuming the parameters
+ * have been calculated from the data sample. If the parameters for the
theoretical distribution
+ * are not in a valid domain, throws an exception.
+ * @param data `RDD[Double]` the data to be test
+ * @param distName `String` name of the theoretical distribution to test
against. Currently
+ * supports Normal ("norm"), Exponential ("exp"), Gumbel
("gumbel"),
+ * Logistic ("logistic"), and Weibull ("weibull")
distributions
+ * @param params A series of optional parameters providing the
parameters for the theoretical
+ * distribution. Only the Normal and Exponential
distributions support
+ * direct estimation of parameters, all others require
that the user provide them.
+ * The order of parameters are as follow
+ * Normal -> [mu, sigma] (location, scale)
+ * Exponential -> [1 / lambda] (scale)
+ * Gumbel -> [mu, beta] (location, scale)
+ * Logistic -> [mu, s] (location, scale)
+ * Weibull -> [lambda, k] (scale, shape)
+ * @return `ADTestResult`
+ */
+ def adTest(data: RDD[Double], distName: String, params: Double*):
ADTestResult = {
--- End diff --
Sure. I've changed that now. I initially went with adTest as a parallel to
R's ad.test in the nortest library. SciPy calls their AD test in stats
anderson, which seems a bit mean to Darling.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]