[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Zendejas updated SPARK-1473:
------------------------------------

    Description: 
For classification tasks involving large feature spaces in the order of tens of 
thousands or higher (e.g., text classification with n-grams, where n > 1), it 
is often useful to rank and filter features that are irrelevant thereby 
reducing the feature space by at least one or two orders of magnitude without 
impacting performance on key evaluation metrics (accuracy/precision/recall).

A feature evaluation interface which is flexible needs to be designed and at 
least two methods should be implemented with Information Gain being a priority 
as it has been shown to be amongst the most reliable.

Special consideration should be taken in the design to account for wrapper 
methods (see research papers below) which are more practical for lower 
dimensional data.

Relevant research:
* Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
likelihood maximisation: a unifying framework for information theoretic
feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
* Forman, George. "An extensive empirical study of feature selection metrics 
for text classification." The Journal of machine learning research 3 (2003): 
1289-1305.

  was:
For classification tasks involving large feature spaces in the order of tens of 
thousands (e.g., text classification with n-grams, where n > 1), it is often 
useful to rank and filter features that are irrelevant reducing the feature 
space by at least one or two orders of magnitude without impacting performance 
on key evaluation metrics (accuracy/precision/recall).

A feature evaluation interface which is flexible needs to be designed and at 
least two methods should be implemented with Information Gain being a priority 
as it has been shown to be amongst the most reliable.

Special consideration should be taken in the design to account for wrapper 
methods (see research papers below) which are more practical for lower 
dimensional data.

Relevant research:
* Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
likelihood maximisation: a unifying framework for information theoretic
feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
* Forman, George. "An extensive empirical study of feature selection metrics 
for text classification." The Journal of machine learning research 3 (2003): 
1289-1305.


> Feature selection for high dimensional datasets
> -----------------------------------------------
>
>                 Key: SPARK-1473
>                 URL: https://issues.apache.org/jira/browse/SPARK-1473
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Ignacio Zendejas
>            Priority: Minor
>              Labels: features
>             Fix For: 1.1.0
>
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to