[
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993742#comment-13993742
]
Erik J. Erlandson commented on SPARK-1473:
------------------------------------------
I'm fairly new to Spark, and hopefully what follows isn't old news...
Feature subsetting ought (imo) to be considered as part of a larger picture
that involves various ETL-like tasks such as
*) data assessment -- examining data columns to assess data types (real,
integer, categorical/binary), identify noise in data (empty/missing values, bad
values), suggest possible quantizations
*) data quantization -- mapping values into byte encodings, sparse binary, etc
*) dataset transposition -- moving from sample-wise to feature-wise orientation
(e.g. decision tree training can work more efficiently when data can be
traversed by feature)
*) feature extraction, augmentation, reduction
I don't yet have a strong feel for how these tasks should best work in spark,
but in my previous lives I've found they are common and closely-integrated
tasks when preparing for the care and feeding of ML models.
> Feature selection for high dimensional datasets
> -----------------------------------------------
>
> Key: SPARK-1473
> URL: https://issues.apache.org/jira/browse/SPARK-1473
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Ignacio Zendejas
> Priority: Minor
> Labels: features
> Fix For: 1.1.0
>
>
> For classification tasks involving large feature spaces in the order of tens
> of thousands or higher (e.g., text classification with n-grams, where n > 1),
> it is often useful to rank and filter features that are irrelevant thereby
> reducing the feature space by at least one or two orders of magnitude without
> impacting performance on key evaluation metrics (accuracy/precision/recall).
> A feature evaluation interface which is flexible needs to be designed and at
> least two methods should be implemented with Information Gain being a
> priority as it has been shown to be amongst the most reliable.
> Special consideration should be taken in the design to account for wrapper
> methods (see research papers below) which are more practical for lower
> dimensional data.
> Relevant research:
> * Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional
> likelihood maximisation: a unifying framework for information theoretic
> feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
> * Forman, George. "An extensive empirical study of feature selection metrics
> for text classification." The Journal of machine learning research 3 (2003):
> 1289-1305.
--
This message was sent by Atlassian JIRA
(v6.2#6252)