[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

sam (JIRA) Sun, 26 Oct 2014 03:06:08 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184453#comment-14184453
 ]


sam commented on SPARK-1473:
----------------------------

[~gbr...@cs.man.ac.uk] Thanks for taking the time to respond to my questions, 
and I thank you again for writing the paper as I always enjoy reading 
foundational (i.e. information theoretic) approaches to Machine Learning.

Regarding your final point about empiricism, yes this is better than 
"arbitrary" and so my original comment was too strong. I guess I was hoping for 
the same kind of foundational approach used to define the feature selection, 
and I am optimistic that there does exist a principled approach to how to 
define independence (which I think would also link with estimation).

I notice that your email address indicates that you are at Manchester 
University (I must have overlooked this when reading the paper - typical 
mathematician).  This is where I learnt about Information Theory - in the maths 
department; Jeff Paris, George Wilmers, Vencovska, etc have all done sterling 
work.

Do you ever come to London? Do you have any interest in applications? We have a 
Spark Meetup in London and it would be great if you could attend - much easier 
to share ideas in person. Perhaps yourself and [~torito1984] may even be 
willing to give a talk on "Information Theoretic Feature Selection with 
Implementation in Spark"?

> Feature selection for high dimensional datasets
> -----------------------------------------------
>
>                 Key: SPARK-1473
>                 URL: https://issues.apache.org/jira/browse/SPARK-1473
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Ignacio Zendejas
>            Assignee: Alexander Ulanov
>            Priority: Minor
>              Labels: features
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

Reply via email to