Github user josepablocam commented on a diff in the pull request:
https://github.com/apache/spark/pull/7430#discussion_r34832557
--- Diff: python/pyspark/mllib/stat/_statistics.py ---
@@ -238,6 +242,54 @@ def chiSqTest(observed, expected=None):
jmodel = callMLlibFunc("chiSqTest",
_convert_to_vector(observed), expected)
return ChiSqTestResult(jmodel)
+ @staticmethod
+ @ignore_unicode_prefix
+ def kolmogorovSmirnovTest(data, distName="norm", *params):
+ """
+ .. note:: Experimental
+
+ Performs the Kolmogorov Smirnov (KS) test for data sampled from a
continuous
+ distribution. It tests the null hypothesis that the data is
generated from a
+ particular distribution.
+
+ The given data is sorted, the Empirical Cumulative Distribution
Function (ECDF)
+ is calculated which is the number of points having a CDF value
lesser than a given point
+ divided by the total number of points. Since the data is sorted,
this is a step function
+ that rises by (1 / length of data) for every ordered point.
+
+ The KS statistic gives us the maximum distance between the ECDF
and the CDF. Intuitively
+ if this value is large, the probabilty that the null hypothesis is
true becomes small.
+ For specific details of the implementation, please have a look at
the Scala documentation.
+
+ :param data: RDD, samples from the data
+ :param distName: string, currently only "norm" is suuported.
(Normal distribution)
--- End diff --
suuported -> supported
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]