[GitHub] spark pull request: [SPARK-8996] [MLlib] [PySpark] Python API for ...

josepablocam Thu, 16 Jul 2015 13:13:49 -0700

Github user josepablocam commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7430#discussion_r34832557
  
    --- Diff: python/pyspark/mllib/stat/_statistics.py ---
    @@ -238,6 +242,54 @@ def chiSqTest(observed, expected=None):
                 jmodel = callMLlibFunc("chiSqTest", 
_convert_to_vector(observed), expected)
             return ChiSqTestResult(jmodel)
     
    +    @staticmethod
    +    @ignore_unicode_prefix
    +    def kolmogorovSmirnovTest(data, distName="norm", *params):
    +        """
    +        .. note:: Experimental
    +
    +        Performs the Kolmogorov Smirnov (KS) test for data sampled from a 
continuous
    +        distribution. It tests the null hypothesis that the data is 
generated from a
    +        particular distribution.
    +
    +        The given data is sorted, the Empirical Cumulative Distribution 
Function (ECDF)
    +        is calculated which is the number of points having a CDF value 
lesser than a given point
    +        divided by the total number of points. Since the data is sorted, 
this is a step function
    +        that rises by (1 / length of data) for every ordered point.
    +
    +        The KS statistic gives us the maximum distance between the ECDF 
and the CDF. Intuitively
    +        if this value is large, the probabilty that the null hypothesis is 
true becomes small.
    +        For specific details of the implementation, please have a look at 
the Scala documentation.
    +
    +        :param data: RDD, samples from the data
    +        :param distName: string, currently only "norm" is suuported. 
(Normal distribution)
    --- End diff --
    
    suuported -> supported



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-8996] [MLlib] [PySpark] Python API for ...

Reply via email to