[
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14621491#comment-14621491
]
Jose Cambronero commented on SPARK-8884:
----------------------------------------
You can find it in the R nortest library, and in SciPy's stats library. The use
cases are the same as KS, with the advantage that it is better suited to
detecting deviations at the tails of the distributions. It provides users an
alternative over KS, a la " more than one way to skin a cat".
The statistic is implemented as a sum, so the algorithm is just decomposing
that into 2 portions. One that we can calculate in a per-partition basis, and
the remaining portion which we scale by a factor and add in at the end. I can
write up a clear step-by-step breakout from the original formula to this one,
if that is something people might find useful.
> 1-sample Anderson-Darling Goodness-of-Fit test
> ----------------------------------------------
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Jose Cambronero
> Priority: Minor
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add
> to the current hypothesis testing functionality. The current implementation
> supports various distributions (normal, exponential, gumbel, logistic, and
> weibull). However, users must provide distribution parameters for all except
> normal/exponential (in which case they are estimated from the data). In
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support
> specific distributions as the critical values depend on the distribution
> being tested.
> The distributed implementation of AD takes advantage of the fact that we can
> calculate a portion of the statistic within each partition of a sorted data
> set, independent of the global order of those observations. We can then carry
> some additional information that allows us to adjust the final amounts once
> we have collected 1 result per partition.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]