GitHub user sramirez opened a pull request:

    https://github.com/apache/spark/pull/5184

    An Information Theoretic Feature Selection Framework

    **Information Theoretic Feature Selection Framework**
    
    The present framework implements Feature Selection (FS) on Spark for its 
application on Big Data problems. This package contains a generic 
implementation of greedy Information Theoretic Feature Selection methods. The 
implementation is based on the common theoretic framework presented in [1]. 
Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are 
provided. In addition, the framework can be extended with other criteria 
provided by the user as long as the process complies with the framework 
proposed in [1].
    
    -- Main features:
    * Support for sparse data (in progress).
    * Pool optimization for high-dimensional.
    * Improved performance from previous version.
    
    This work has associated two submitted contributions to international 
journals which will be attached to this request as soon as they are accepted 
This software has been proved with two large real-world datasets such as:
    
    - A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 
competition, which comes from the Protein Structure Prediction field 
(http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, 631 
attributes, 2 classes, 98% of negative examples and occupies, when 
uncompressed, about 56GB of disk space.
    - Epsilon dataset: 
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon. 
400K instances and 2K attributes.
    
    -- Brief benchmark results:
    
    * 150 seconds by selected feature for a 65M dataset with 631 attributes. 
    *  For epsilon dataset, we have outperformed the results without FS for 
three classifers from Spark using only 2.5% of original features.
    
    References
    
    [1] Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). 
    "Conditional likelihood maximisation: a unifying framework for information 
theoretic feature selection." 
    The Journal of Machine Learning Research, 13(1), 27-66.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sramirez/spark fselection

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5184
    
----
commit 4433c3632a90089468fa197d7249f8a41a3b4b9a
Author: Sergio Ramirez <[email protected]>
Date:   2015-03-20T14:43:29Z

    Version compatible with Spark 1.2.0

commit ce6b7f815557d0aa8b261f845f87aaee754a18a6
Author: Sergio Ramirez <[email protected]>
Date:   2015-03-20T14:43:55Z

    Version compatible with Spark 1.2.0, new changes

commit 1f77e35de740f6c57659e19d9314d76e2c56fd56
Author: Sergio Ramirez <[email protected]>
Date:   2015-03-23T11:49:27Z

    Fixed rare problem with import RDD

commit 96e68146aa7b32082019877f81a0dba3cdee2ad2
Author: Sergio Ramirez <[email protected]>
Date:   2015-03-23T17:03:19Z

    Changed default values for both algorithms. Fixed a bug in discretization 
transform.

commit d86967d9ba6f76c022eb452761a8b9e949d18161
Author: Sergio Ramirez <[email protected]>
Date:   2015-03-25T10:15:00Z

    Removed discretization part from PR. ChiSqSelected returned to its original 
version.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to