[
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168590#comment-14168590
]
sam commented on SPARK-1473:
----------------------------
[~torito1984] Thank you for the response, and apologies for my delay in
responding.
Yes the problems of trying to estimate probabilities when independence
assumptions are not made indeed make it necessary to consider some features
independent. My question is *how* should we do this? Is there any literature
that has attempted to **formalize the way we introduce independence** in
*information theoretic* terms. Moreover I see this problem, and feature
selection in general, as problems that are tightly coupled with the way
probability estimation is performed.
Suppose in the simplest case we wish to decide whether features F_1 and F_2 are
dependent (we could consider arbitrary conjunctions too). Then the Information
Theorist would want to consider the Mutual Information, i.e. the KL between the
joint and product of marginals:
KL( p(F_1, F_2) || p(F_1) * p(F_2) )
Then use a threshold or rank on feature pairs to determine whether to consider
them dependent.
Now this is where we are tightly coupled with the means by which we estimate
the probabilities p(F_1, F_2), p(F_1) and p(F_2). We could use Maximum
Liklihood with Laplace Smoothing, MAP / Regularization, etc, or the much lesser
known Carnap's Continuum of Inductive Methods. Which method we choose along
with the usual arbitrary choice of some constant (e.g. alpha in
Laplace/Additive Smoothing) will determine p(F_1, F_2), p(F_1) and p(F_2) and
therefore determine whether or not F_1 & F_2 are to be considered dependent.
The current practice in Machine Learning has been to choose a method of
estimation based off x-validation results rather than some deep philosophical
justification. Prof' Jeff Paris's work and his colleagues is the only work
I've seen that attempts to use Information Theoretic principles to estimate
probabilities. Unfortunately the work is a little incomplete with regard to
practical application.
To summarize, although I like the paper, especially it's principled approach
(vs the "just test and see" commonly seen in Data Science), how independence is
to be assumed (to solve the exponential sparsity problem) is left as arbitrary,
and so is the choice of probability estimation, and therefore it is not fully
principled nor fully foundational.
Please do not interpret this comment as a rejection/attack on the paper, rather
I consider it a little incomplete and was hoping someone may have found a line
of research more successful than my own to fill in the gaps.
> Feature selection for high dimensional datasets
> -----------------------------------------------
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Ignacio Zendejas
> Assignee: Alexander Ulanov
> Priority: Minor
> Labels: features
>
> For classification tasks involving large feature spaces in the order of tens
> of thousands or higher (e.g., text classification with n-grams, where n > 1),
> it is often useful to rank and filter features that are irrelevant thereby
> reducing the feature space by at least one or two orders of magnitude without
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at
> least two methods should be implemented with Information Gain being a
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper
> methods (see research papers below) which are more practical for lower
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics
> for text classification." The Journal of machine learning research 3 (2003):
> 1289-1305.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]