[GitHub] spark pull request: [SPARK-3694] [MLlib] [PySpark] add Hypothesis ...

davies Tue, 04 Nov 2014 09:35:34 -0800

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/3091


    [SPARK-3694] [MLlib] [PySpark] add Hypothesis test Python API

    ```
    pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
        :: Experimental ::
    
        If `observed` is Vector, conduct Pearson's chi-squared goodness
        of fit test of the observed data against the expected distribution,
        or againt the uniform distribution (by default), with each category
        having an expected frequency of `1 / len(observed)`.
        (Note: `observed` cannot contain negative values)
    
        If `observed` is matrix, conduct Pearson's independence test on the
        input contingency matrix, which cannot contain negative entries or
        columns or rows that sum up to 0.
    
        If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
        test for every feature against the label across the input RDD.
        For each feature, the (feature, label) pairs are converted into a
        contingency matrix for which the chi-squared statistic is computed.
        All label and feature values must be categorical.
    
        :param observed: it could be a vector containing the observed 
categorical
                         counts/relative frequencies, or the contingency matrix
                         (containing either counts or relative frequencies),
                         or an RDD of LabeledPoint containing the labeled 
dataset
                         with categorical features. Real-valued features will be
                         treated as categorical for each distinct value.
        :param expected: Vector containing the expected categorical 
counts/relative
                         frequencies. `expected` is rescaled if the `expected` 
sum
                         differs from the `observed` sum.
        :return: ChiSquaredTest object containing the test statistic, degrees
                 of freedom, p-value, the method used, and the null hypothesis.
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark his

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3091.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3091
    
----
commit 5097d54d228032a1b49d51df1e5381e418055f5b
Author: Davies Liu <[email protected]>
Date:   2014-11-04T17:32:41Z

    add Hypothesis test Python API

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3694] [MLlib] [PySpark] add Hypothesis ...

Reply via email to