GitHub user josepablocam opened a pull request:
https://github.com/apache/spark/pull/7075
[SPARK-8674] [MLlib] Implementation of a 2 sample Kolmogorov Smirnov Test
The current patch implements a 2-sample, 2-sided Kolmogorov Smirnov test.
Similarly to the 1-sample implementation, we seek to reduce the shuffles
necessary for computation. The user can provide 2 RDD[Double] and the
Statistics.ksTest function allows them to test the null hypothesis that both
samples came from the same probability distribution.
This patch includes the 1-sample test (so that reviewers can see the
broader context of the change), however, that portion (and relevant tests) are
being reviewed at https://issues.apache.org/jira/browse/SPARK-8598.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/josepablocam/spark twoSampleKSTest
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7075.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7075
----
commit 13dfe4dfac5bc520ce7df4dafca3f6bacb1977e3
Author: jose.cambronero <[email protected]>
Date: 2015-06-24T17:28:02Z
created test result class for ks test
commit c659ea16a06ed5e1d84b80c83582bac56bf283ce
Author: jose.cambronero <[email protected]>
Date: 2015-06-24T18:00:14Z
created KS test class
commit 4da189b77d18b8440d77753d6e30d4100d1f53ff
Author: jose.cambronero <[email protected]>
Date: 2015-06-24T18:00:32Z
added user facing ks test functions
commit ce8e9a1ae1f37aa8f4cfcf7c6352006a93aa8ee4
Author: jose.cambronero <[email protected]>
Date: 2015-06-24T18:45:28Z
added kstest testing in HypothesisTestSuite
commit b9cff3ad2322a6aeb67eefa5b8e67f6fedc6d715
Author: jose.cambronero <[email protected]>
Date: 2015-06-24T19:44:16Z
made small changes to pass style check
commit f6951b60457dc749da9214833841690491f434df
Author: jose.cambronero <[email protected]>
Date: 2015-06-24T20:24:43Z
changed style and some comments based on feedback from pull request
commit c18dc661fc915e648500c4cf04805f787a8a9c40
Author: jose.cambronero <[email protected]>
Date: 2015-06-24T20:29:49Z
removed ksTestOpt from API and changed comments in HypothesisTestSuite
accordingly
commit 16b5c4cfc4cc81164e3a5e2e8adaca017b7e4514
Author: jose.cambronero <[email protected]>
Date: 2015-06-25T17:13:13Z
renamed dat to data and eliminated recalc of RDD size by sharing as
argument between empirical and evalOneSampleP
commit 0b5e8ecbbf30ae54e55f4ba0e3c90542a4b4222e
Author: jose.cambronero <[email protected]>
Date: 2015-06-25T22:43:59Z
changed KS one sample test to perform just 1 distributed pass (in addition
to the sorting pass), operates on each partition separately. Implementation of
Sandy Ryza's algorithm
commit 4b8ba611de8dd5abe12e4067a234aab026fdc05d
Author: jose.cambronero <[email protected]>
Date: 2015-06-25T23:33:04Z
fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf,
but prior to adj it was below
commit 356c9864612099d5024f257265e958fbfcbbd6ae
Author: jose.cambronero <[email protected]>
Date: 2015-06-26T01:03:28Z
first working version of 1 pass 2 sample ks test
commit 6a4784f41993a658b45ccb54113d480b067dc3ea
Author: jose.cambronero <[email protected]>
Date: 2015-06-26T01:07:30Z
specified what distributions are available for the convenience method
ksTest(data, name) (solely standard normal)
commit 992293b8a8fb0dcfe1599171d33dd8038b1c4112
Author: jose.cambronero <[email protected]>
Date: 2015-06-26T01:24:12Z
Style changes as per comments and added implementation note explaining the
distributed approach.
commit c85f74d7f079372548c65e1f4827a0286179d472
Author: jose.cambronero <[email protected]>
Date: 2015-06-26T18:35:39Z
changed searchTwoSampleCandidates to use a case class as an accumulator,
over a tuple, makes for easier reading
commit 9f976217c405218a8d88d9188156cc20d4ee43a8
Author: jose.cambronero <[email protected]>
Date: 2015-06-26T18:50:10Z
cleaned up ks2 and small change in type declaration to make sure output
matches apache math commons result in precision
commit 5a6bc91f9fb7055d15406742a9f7cba010de2ac2
Author: jose.cambronero <[email protected]>
Date: 2015-06-26T19:13:51Z
added API function for 2 sample ks test
commit 52ca38ab013e7dd87fe49df1fa81367caed35910
Author: jose.cambronero <[email protected]>
Date: 2015-06-26T19:28:44Z
added tests for ks 2 sample
commit 4af0f8159e0003799694e75e408e11db61cfd4b9
Author: jose.cambronero <[email protected]>
Date: 2015-06-26T19:32:15Z
Merge branch 'master' into onePassTwoSample
o account for changes in ks 1 sample
commit 3f81ad25505b38a3d929314842078ac9730a4142
Author: jose.cambronero <[email protected]>
Date: 2015-06-26T19:34:23Z
renamed ks1 sample test for clarity
commit c5c8e0082593a8d1e3eeeab390e003c7c8b6d4b0
Author: jose.cambronero <[email protected]>
Date: 2015-06-26T19:34:46Z
Merge branch 'master' into onePassTwoSample
commit 9c0f1af882c930cafe55fe828c0c2d0fbe2d23f1
Author: jose.cambronero <[email protected]>
Date: 2015-06-26T21:55:18Z
additional style changes incorporated and added documentation to mllib
statistics docs
commit ea5dcf5c54ccfabc4a534401bf635da57f9de953
Author: jose.cambronero <[email protected]>
Date: 2015-06-26T23:16:31Z
added 2 sample ks scaladocs comments
commit ff1c5269940aaf6c64cecb6658279777902b4a7b
Author: jose.cambronero <[email protected]>
Date: 2015-06-26T23:42:32Z
minor style change and added test for ks2test when there are
non-overlapping samples, i.e. when some partitions have only data from 1 of the
2 samples
commit 2d83da71953cf9fcef53eb6d91b5f9e0a305b7ff
Author: Jose Cambronero <[email protected]>
Date: 2015-06-27T20:45:30Z
added small explanation of approach for 2-sample, relative to 1-sample test
commit 168134d66bd417084fe28f5084eef555f7991a56
Author: Jose Cambronero <[email protected]>
Date: 2015-06-27T20:48:53Z
small style changes
commit 3cb6bc8cc5191d735d165adaff1ef0f0fec62ee2
Author: Jose Cambronero <[email protected]>
Date: 2015-06-27T20:50:08Z
Merge remote-tracking branch 'origin/onePassTwoSample' into onePassTwoSample
commit 3795d4acd177acbac7ca37b5ee41a944fc5093d8
Author: Jose Cambronero <[email protected]>
Date: 2015-06-28T20:27:55Z
Merge branch 'master' into twoSampleKSTest
so that new patch includes all parts associated with 1 sample ks test
so reviewers can see in context
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]