[
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183075#comment-14183075
]
Gavin Brown commented on SPARK-1473:
------------------------------------
Sorry, one extra point I didn't notice before....
Sam said "how independence is to be assumed (to solve the exponential sparsity
problem) is left as arbitrary, and so is the choice of probability estimation"
.... that's not quite true. The latter is, we did not specify how to estimate
probabilities. However the former, we conducted a major empirical study,
judging which independence assumptions had the best properties in terms of
accuracy and stability with small/large samples, as well as which independence
assumption permitted the most efficient implementations, across a wide range of
data sets (26 sets if memory serves).
Our conclusions were empirical, but really in this case there can be no other
way. As we stated, we reached the limit of what is possible in theory terms,
and had to continue with empirical studies. To identify what independence holds
in any particular data set is to tackle the model fitting problem itself.
Therefore all that can be done is to use the empirical study results, which
recommend the JMI and CMIM criteria without a doubt. Certain other criteria (eg
MRMR) we find are misleading and dangerous to use.
> Feature selection for high dimensional datasets
> -----------------------------------------------
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Ignacio Zendejas
> Assignee: Alexander Ulanov
> Priority: Minor
> Labels: features
>
> For classification tasks involving large feature spaces in the order of tens
> of thousands or higher (e.g., text classification with n-grams, where n > 1),
> it is often useful to rank and filter features that are irrelevant thereby
> reducing the feature space by at least one or two orders of magnitude without
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at
> least two methods should be implemented with Information Gain being a
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper
> methods (see research papers below) which are more practical for lower
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics
> for text classification." The Journal of machine learning research 3 (2003):
> 1289-1305.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]