GitHub user avulanov opened a pull request:
https://github.com/apache/spark/pull/1484
[MLLIB] [WIP] SPARK-1473: Feature selection for high dimensional datasets
The following is implemented:
1) generic traits for feature selection and filtering
2) trait for feature selection of LabeledPoint with discrete data
3) traits for calculation of contingency table and chi squared
4) class for chi-squared feature selection
5) tests for the above
Needs some optimization in matrix operations.
This request is a try to implement feature selection for MLLIB, the
previous work by the issue author @izendejas was not finished
(https://issues.apache.org/jira/browse/SPARK-1473). This request is also
related to data discretization issues:
https://issues.apache.org/jira/browse/SPARK-1303 and
https://issues.apache.org/jira/browse/SPARK-1216 that weren't merged.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/avulanov/spark featureselection
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1484.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1484
----
commit 560dc08d7e2cbc191016a3ebbec1eb8146630bc7
Author: Alexander Ulanov <[email protected]>
Date: 2014-07-08T08:25:57Z
Chi Squared feature selection: initial version
commit 6a35bcf64ff9c71445dc48f8299f8f78a5e324d5
Author: Alexander Ulanov <[email protected]>
Date: 2014-07-08T09:43:27Z
Code style
commit dfb09fbf2732682d0b86afcbe02eb097e7d9c09e
Author: Alexander Ulanov <[email protected]>
Date: 2014-07-09T10:06:54Z
Feature selection filter
commit fa5fd1119c6cc0e2c48a74baed89d32c5a1b5a58
Author: Alexander Ulanov <[email protected]>
Date: 2014-07-09T15:55:07Z
Traits for FeatureSelection, CombinationsCalculator and FeatureFilter
commit 9a8f968ef07ee9a3cfc372d6e4d335d45ef5c065
Author: Alexander Ulanov <[email protected]>
Date: 2014-07-11T09:14:29Z
Feature selection redesign with vigdorchik
commit 099fb135e159407ae9acf0a1dcbaf23fbc5e781a
Author: Alexander Ulanov <[email protected]>
Date: 2014-07-11T16:04:36Z
Feature selector, fix of lazyness
commit 774b5ca9d4155315b388aae12e58d32b90c479fe
Author: Alexander Ulanov <[email protected]>
Date: 2014-07-14T16:52:28Z
Combinations and chi-squared values test
commit 43a1169687db70ea52753e7e86eccb55ed0bf43e
Author: Alexander Ulanov <[email protected]>
Date: 2014-07-17T15:20:31Z
Chi Squared by contingency table. Refactoring
commit 6890617e47f03278d08ea17929adf74dfa668230
Author: Alexander Ulanov <[email protected]>
Date: 2014-07-18T09:41:05Z
Scala style fix
commit 2565f6d9a24c892a6ed28ec1174b5a7077fd8c77
Author: Alexander Ulanov <[email protected]>
Date: 2014-07-18T11:10:48Z
Tests, comments, apache headers and scala style
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---