[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14184457#comment-14184457
 ] 

David Martinez Rego commented on SPARK-1473:
--------------------------------------------

Dear Sam, 

Thank you for the invitation. Funny enough, I am a usual at the meet ups and I 
have been already invited by Martin Goodson to do a talk about ... "selected 
topics on ML in Big Data". Currently I have a lab in Spain polishing the code 
and deploying it on a cluster to prove its performance (and support a future 
pull request). Dr. Brown has suggested me a couple of improvements using 
semi-supervised data. When we have solid results, at least on my side, I would 
love to share them with the community.

> Feature selection for high dimensional datasets
> -----------------------------------------------
>
>                 Key: SPARK-1473
>                 URL: https://issues.apache.org/jira/browse/SPARK-1473
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Ignacio Zendejas
>            Assignee: Alexander Ulanov
>            Priority: Minor
>              Labels: features
>
> For classification tasks involving large feature spaces in the order of tens 
> of thousands or higher (e.g., text classification with n-grams, where n > 1), 
> it is often useful to rank and filter features that are irrelevant thereby 
> reducing the feature space by at least one or two orders of magnitude without 
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at 
> least two methods should be implemented with Information Gain being a 
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper 
> methods (see research papers below) which are more practical for lower 
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics 
> for text classification." The Journal of machine learning research 3 (2003): 
> 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to