GitHub user josepablocam opened a pull request:

    https://github.com/apache/spark/pull/7075

    [SPARK-8674] [MLlib] Implementation of a 2 sample Kolmogorov Smirnov Test

    The current patch implements a 2-sample, 2-sided Kolmogorov Smirnov test. 
Similarly to the 1-sample implementation, we seek to reduce the shuffles 
necessary for computation. The user can provide 2 RDD[Double] and the 
Statistics.ksTest function allows them to test the null hypothesis that both 
samples came from the same probability distribution.
    
    This patch includes the 1-sample test (so that reviewers can see the 
broader context of the change), however, that portion (and relevant tests) are 
being reviewed at https://issues.apache.org/jira/browse/SPARK-8598.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/josepablocam/spark twoSampleKSTest

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7075.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7075
    
----
commit 13dfe4dfac5bc520ce7df4dafca3f6bacb1977e3
Author: jose.cambronero <[email protected]>
Date:   2015-06-24T17:28:02Z

    created test result class for ks test

commit c659ea16a06ed5e1d84b80c83582bac56bf283ce
Author: jose.cambronero <[email protected]>
Date:   2015-06-24T18:00:14Z

    created KS test class

commit 4da189b77d18b8440d77753d6e30d4100d1f53ff
Author: jose.cambronero <[email protected]>
Date:   2015-06-24T18:00:32Z

    added user facing ks test functions

commit ce8e9a1ae1f37aa8f4cfcf7c6352006a93aa8ee4
Author: jose.cambronero <[email protected]>
Date:   2015-06-24T18:45:28Z

    added kstest testing in HypothesisTestSuite

commit b9cff3ad2322a6aeb67eefa5b8e67f6fedc6d715
Author: jose.cambronero <[email protected]>
Date:   2015-06-24T19:44:16Z

    made small changes to pass style check

commit f6951b60457dc749da9214833841690491f434df
Author: jose.cambronero <[email protected]>
Date:   2015-06-24T20:24:43Z

    changed style and some comments based on feedback from pull request

commit c18dc661fc915e648500c4cf04805f787a8a9c40
Author: jose.cambronero <[email protected]>
Date:   2015-06-24T20:29:49Z

    removed ksTestOpt from API and changed comments in HypothesisTestSuite 
accordingly

commit 16b5c4cfc4cc81164e3a5e2e8adaca017b7e4514
Author: jose.cambronero <[email protected]>
Date:   2015-06-25T17:13:13Z

    renamed dat to data and eliminated recalc of RDD size by sharing as 
argument between empirical and evalOneSampleP

commit 0b5e8ecbbf30ae54e55f4ba0e3c90542a4b4222e
Author: jose.cambronero <[email protected]>
Date:   2015-06-25T22:43:59Z

    changed KS one sample test to perform just 1 distributed pass (in addition 
to the sorting pass), operates on each partition separately. Implementation of 
Sandy Ryza's algorithm

commit 4b8ba611de8dd5abe12e4067a234aab026fdc05d
Author: jose.cambronero <[email protected]>
Date:   2015-06-25T23:33:04Z

    fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, 
but prior to adj it was below

commit 356c9864612099d5024f257265e958fbfcbbd6ae
Author: jose.cambronero <[email protected]>
Date:   2015-06-26T01:03:28Z

    first working version of 1 pass 2 sample ks test

commit 6a4784f41993a658b45ccb54113d480b067dc3ea
Author: jose.cambronero <[email protected]>
Date:   2015-06-26T01:07:30Z

    specified what distributions are available for the convenience method 
ksTest(data, name) (solely standard normal)

commit 992293b8a8fb0dcfe1599171d33dd8038b1c4112
Author: jose.cambronero <[email protected]>
Date:   2015-06-26T01:24:12Z

    Style changes as per comments and added implementation note explaining the 
distributed approach.

commit c85f74d7f079372548c65e1f4827a0286179d472
Author: jose.cambronero <[email protected]>
Date:   2015-06-26T18:35:39Z

    changed searchTwoSampleCandidates to use a case class as an accumulator, 
over a tuple, makes for easier reading

commit 9f976217c405218a8d88d9188156cc20d4ee43a8
Author: jose.cambronero <[email protected]>
Date:   2015-06-26T18:50:10Z

    cleaned up ks2 and small change in type declaration to make sure output 
matches apache math commons result in precision

commit 5a6bc91f9fb7055d15406742a9f7cba010de2ac2
Author: jose.cambronero <[email protected]>
Date:   2015-06-26T19:13:51Z

    added API function for 2 sample ks test

commit 52ca38ab013e7dd87fe49df1fa81367caed35910
Author: jose.cambronero <[email protected]>
Date:   2015-06-26T19:28:44Z

    added tests for ks 2 sample

commit 4af0f8159e0003799694e75e408e11db61cfd4b9
Author: jose.cambronero <[email protected]>
Date:   2015-06-26T19:32:15Z

    Merge branch 'master' into onePassTwoSample
    
    o account for changes in ks 1 sample

commit 3f81ad25505b38a3d929314842078ac9730a4142
Author: jose.cambronero <[email protected]>
Date:   2015-06-26T19:34:23Z

    renamed ks1 sample test for clarity

commit c5c8e0082593a8d1e3eeeab390e003c7c8b6d4b0
Author: jose.cambronero <[email protected]>
Date:   2015-06-26T19:34:46Z

    Merge branch 'master' into onePassTwoSample

commit 9c0f1af882c930cafe55fe828c0c2d0fbe2d23f1
Author: jose.cambronero <[email protected]>
Date:   2015-06-26T21:55:18Z

    additional style changes incorporated and added documentation to mllib 
statistics docs

commit ea5dcf5c54ccfabc4a534401bf635da57f9de953
Author: jose.cambronero <[email protected]>
Date:   2015-06-26T23:16:31Z

    added 2 sample ks scaladocs comments

commit ff1c5269940aaf6c64cecb6658279777902b4a7b
Author: jose.cambronero <[email protected]>
Date:   2015-06-26T23:42:32Z

    minor style change and added test for ks2test when there are 
non-overlapping samples, i.e. when some partitions have only data from 1 of the 
2 samples

commit 2d83da71953cf9fcef53eb6d91b5f9e0a305b7ff
Author: Jose Cambronero <[email protected]>
Date:   2015-06-27T20:45:30Z

    added small explanation of approach for 2-sample, relative to 1-sample test

commit 168134d66bd417084fe28f5084eef555f7991a56
Author: Jose Cambronero <[email protected]>
Date:   2015-06-27T20:48:53Z

    small style changes

commit 3cb6bc8cc5191d735d165adaff1ef0f0fec62ee2
Author: Jose Cambronero <[email protected]>
Date:   2015-06-27T20:50:08Z

    Merge remote-tracking branch 'origin/onePassTwoSample' into onePassTwoSample

commit 3795d4acd177acbac7ca37b5ee41a944fc5093d8
Author: Jose Cambronero <[email protected]>
Date:   2015-06-28T20:27:55Z

    Merge branch 'master' into twoSampleKSTest
    so that new patch includes all parts associated with 1 sample ks test
    so reviewers can see in context

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to