[
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14157047#comment-14157047
]
David Martinez Rego commented on SPARK-1473:
--------------------------------------------
Sorry for having my name incomplete when I first posted. I am David Martinez
currently at
UCL in London. We had this project abandoned for some time but we will restart
a pull request
shortly. You can see the current version of the code at
https://github.com/LIDIAgroup/SparkFeatureSelection.
In response to a past post, the framework that Dr Gavin Brown presents in a
single unified framework because it does
not make any assumptions when stating the basic probabilistic model for the
problem of FS. The problem of probability estimation that he mentions is not a
philosophical question and is not only a shortcoming of feature selection. The
problem is that the number of parameters that you need to estimate is
exponentially proportional to the number of variables if you do not make any
independence assumption (all possible events). So, to have good estimations of
these parameters (probabilities of events), you need an
exponential number of samples (to observe all possible events you need and
exponential number of observations). That is why you need to make independence
assumptions and follow a greedy strategy to be able to draw some conclusions.
> Feature selection for high dimensional datasets
> -----------------------------------------------
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Ignacio Zendejas
> Assignee: Alexander Ulanov
> Priority: Minor
> Labels: features
>
> For classification tasks involving large feature spaces in the order of tens
> of thousands or higher (e.g., text classification with n-grams, where n > 1),
> it is often useful to rank and filter features that are irrelevant thereby
> reducing the feature space by at least one or two orders of magnitude without
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at
> least two methods should be implemented with Information Gain being a
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper
> methods (see research papers below) which are more practical for lower
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics
> for text classification." The Journal of machine learning research 3 (2003):
> 1289-1305.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]