Ignacio Zendejas created SPARK-1473:
---------------------------------------
Summary: Feature selection for high dimensional datasets
Key: SPARK-1473
URL: https://issues.apache.org/jira/browse/SPARK-1473
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Ignacio Zendejas
Priority: Minor
Fix For: 1.1.0
For classification tasks involving large feature spaces in the order of tens of
thousands (e.g., text classification with n-grams, where n > 1), it is often
useful to rank and filter features that are irrelevant reducing the feature
space by at least one or two orders of magnitude without impacting performance
on key evaluation metrics (accuracy/precision/recall).
A feature evaluation interface which is flexible needs to be designed and at
least two methods should be implemented with Information Gain being a priority
as it has been shown to be amongst the most reliable.
Special consideration should be taken in the design to account for wrapper
methods (see research papers below) which are more practical for lower
dimensional data.
Relevant research:
* Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
likelihood maximisation: a unifying framework for information theoretic
feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
* Forman, George. "An extensive empirical study of feature selection metrics
for text classification." The Journal of machine learning research 3 (2003):
1289-1305.
--
This message was sent by Atlassian JIRA
(v6.2#6252)