[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

sam (JIRA) Sun, 12 Oct 2014 03:59:52 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168590#comment-14168590
 ]


sam commented on SPARK-1473:
----------------------------

[~torito1984] Thank you for the response, and apologies for my delay in 
responding.

Yes the problems of trying to estimate probabilities when independence 
assumptions are not made indeed make it necessary to consider some features 
independent.  My question is *how* should we do this? Is there any literature 
that has attempted to **formalize the way we introduce independence** in 
*information theoretic* terms.  Moreover I see this problem, and feature 
selection in general, as problems that are tightly coupled with the way 
probability estimation is performed.

Suppose in the simplest case we wish to decide whether features F_1 and F_2 are 
dependent (we could consider arbitrary conjunctions too). Then the Information 
Theorist would want to consider the Mutual Information, i.e. the KL between the 
joint and product of marginals:

KL( p(F_1, F_2) || p(F_1) * p(F_2) )

Then use a threshold or rank on feature pairs to determine whether to consider 
them dependent. 

Now this is where we are tightly coupled with the means by which we estimate 
the probabilities p(F_1, F_2), p(F_1) and p(F_2).  We could use Maximum 
Liklihood with Laplace Smoothing, MAP / Regularization, etc, or the much lesser 
known Carnap's Continuum of Inductive Methods.  Which method we choose along 
with the usual arbitrary choice of some constant (e.g. alpha in 
Laplace/Additive Smoothing) will determine p(F_1, F_2), p(F_1) and p(F_2) and 
therefore determine whether or not F_1 & F_2 are to be considered dependent.

The current practice in Machine Learning has been to choose a method of 
estimation based off x-validation results rather than some deep philosophical 
justification.  Prof' Jeff Paris's work and his colleagues is the only work 
I've seen that attempts to use Information Theoretic principles to estimate 
probabilities.  Unfortunately the work is a little incomplete with regard to 
practical application.

To summarize, although I like the paper, especially it's principled approach 
(vs the "just test and see" commonly seen in Data Science), how independence is 
to be assumed (to solve the exponential sparsity problem) is left as arbitrary, 
and so is the choice of probability estimation, and therefore it is not fully 
principled nor fully foundational.

Please do not interpret this comment as a rejection/attack on the paper, rather 
I consider it a little incomplete and was hoping someone may have found a line 
of research more successful than my own to fill in the gaps.

> Feature selection for high dimensional datasets
> -----------------------------------------------
>
>                 Key: SPARK-1473
>                 URL: https://issues.apache.org/jira/browse/SPARK-1473
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Ignacio Zendejas
>            Assignee: Alexander Ulanov
>            Priority: Minor
>              Labels: features
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

Reply via email to