GitHub user sramirez opened a pull request:
https://github.com/apache/spark/pull/5184
An Information Theoretic Feature Selection Framework
**Information Theoretic Feature Selection Framework**
The present framework implements Feature Selection (FS) on Spark for its
application on Big Data problems. This package contains a generic
implementation of greedy Information Theoretic Feature Selection methods. The
implementation is based on the common theoretic framework presented in [1].
Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are
provided. In addition, the framework can be extended with other criteria
provided by the user as long as the process complies with the framework
proposed in [1].
-- Main features:
* Support for sparse data (in progress).
* Pool optimization for high-dimensional.
* Improved performance from previous version.
This work has associated two submitted contributions to international
journals which will be attached to this request as soon as they are accepted
This software has been proved with two large real-world datasets such as:
- A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014
competition, which comes from the Protein Structure Prediction field
(http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, 631
attributes, 2 classes, 98% of negative examples and occupies, when
uncompressed, about 56GB of disk space.
- Epsilon dataset:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon.
400K instances and 2K attributes.
-- Brief benchmark results:
* 150 seconds by selected feature for a 65M dataset with 631 attributes.
* For epsilon dataset, we have outperformed the results without FS for
three classifers from Spark using only 2.5% of original features.
References
[1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012).
"Conditional likelihood maximisation: a unifying framework for information
theoretic feature selection."
The Journal of Machine Learning Research, 13(1), 27-66.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sramirez/spark fselection
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/5184.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #5184
----
commit 4433c3632a90089468fa197d7249f8a41a3b4b9a
Author: Sergio Ramirez <[email protected]>
Date: 2015-03-20T14:43:29Z
Version compatible with Spark 1.2.0
commit ce6b7f815557d0aa8b261f845f87aaee754a18a6
Author: Sergio Ramirez <[email protected]>
Date: 2015-03-20T14:43:55Z
Version compatible with Spark 1.2.0, new changes
commit 1f77e35de740f6c57659e19d9314d76e2c56fd56
Author: Sergio Ramirez <[email protected]>
Date: 2015-03-23T11:49:27Z
Fixed rare problem with import RDD
commit 96e68146aa7b32082019877f81a0dba3cdee2ad2
Author: Sergio Ramirez <[email protected]>
Date: 2015-03-23T17:03:19Z
Changed default values for both algorithms. Fixed a bug in discretization
transform.
commit d86967d9ba6f76c022eb452761a8b9e949d18161
Author: Sergio Ramirez <[email protected]>
Date: 2015-03-25T10:15:00Z
Removed discretization part from PR. ChiSqSelected returned to its original
version.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]