[
https://issues.apache.org/jira/browse/SPARK-6531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-6531.
------------------------------
Resolution: Won't Fix
> An Information Theoretic Feature Selection Framework
> ----------------------------------------------------
>
> Key: SPARK-6531
> URL: https://issues.apache.org/jira/browse/SPARK-6531
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Affects Versions: 1.3.0
> Reporter: Sergio Ramírez
>
> **Information Theoretic Feature Selection Framework**
> The present framework implements Feature Selection (FS) on Spark for its
> application on Big Data problems. This package contains a generic
> implementation of greedy Information Theoretic Feature Selection methods. The
> implementation is based on the common theoretic framework presented in [1].
> Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are
> provided. In addition, the framework can be extended with other criteria
> provided by the user as long as the process complies with the framework
> proposed in [1].
> -- Main features:
> * Support for sparse data (in progress).
> * Pool optimization for high-dimensional.
> * Improved performance from previous version.
> This work has associated two submitted contributions to international
> journals which will be attached to this request as soon as they are accepted
> This software has been proved with two large real-world datasets such as:
> - A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014
> competition, which comes from the Protein Structure Prediction field
> (http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances,
> 631 attributes, 2 classes, 98% of negative examples and occupies, when
> uncompressed, about 56GB of disk space.
> - Epsilon dataset:
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon.
> 400K instances and 2K attributes.
> -- Brief benchmark results:
> * 150 seconds by selected feature for a 65M dataset with 631 attributes.
> * For epsilon dataset, we have outperformed the results without FS for three
> classifers (from MLLIB) using only 2.5% of original features.
> Design doc:
> https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing
> References
> [1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012).
> "Conditional likelihood maximisation: a unifying framework for information
> theoretic feature selection."
> The Journal of Machine Learning Research, 13(1), 27-66.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]