Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7430#discussion_r34910046
  
    --- Diff: python/pyspark/mllib/stat/_statistics.py ---
    @@ -238,6 +242,60 @@ def chiSqTest(observed, expected=None):
                 jmodel = callMLlibFunc("chiSqTest", 
_convert_to_vector(observed), expected)
             return ChiSqTestResult(jmodel)
     
    +    @staticmethod
    +    @ignore_unicode_prefix
    +    def kolmogorovSmirnovTest(data, distName="norm", *params):
    +        """
    +        .. note:: Experimental
    +
    +        Performs the Kolmogorov Smirnov (KS) test for data sampled from
    +        a continuous distribution. It tests the null hypothesis that
    +        the data is generated from a particular distribution.
    +
    +        The given data is sorted and the Empirical Cumulative
    +        Distribution Function (ECDF) is calculated
    +        which for a given point is the number of points having a CDF
    +        value lesser than it divided by the total number of points.
    +
    +        Since the data is sorted, this is a step function
    +        that rises by (1 / length of data) for every ordered point.
    +
    +        The KS statistic gives us the maximum distance between the
    +        ECDF and the CDF. Intuitively if this statistic is large, the
    +        probabilty that the null hypothesis is true becomes small.
    +        For specific details of the implementation, please have a look
    +        at the Scala documentation.
    +
    +        :param data: RDD, samples from the data
    +        :param distName: string, currently only "norm" is supported.
    +                         (Normal distribution) to calculate the
    +                         theoretical distribution of the data.
    +        :param params: additional values which need to be provided for
    +                       a certain distribution.
    +                       If not provided, the default values are used.
    +        :return: KolmogorovSmirnovTestResult object containing the test
    +                 statistic, degrees of freedom, p-value,
    +                 the method used, and the null hypothesis.
    +
    +        >>> kstest = Statistics.kolmogorovSmirnovTest
    --- End diff --
    
    Small thing: can we include an example that passes parameters in?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to