[
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Takeshi Yamamuro updated HIVEMALL-181:
--------------------------------------
Summary: Plan rewriting rules to filter meaningful training data before
feature selections (was: Plan rewrting rules to filter meaningful training
data before feature selections)
> Plan rewriting rules to filter meaningful training data before feature
> selections
> ---------------------------------------------------------------------------------
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
> Issue Type: Improvement
> Reporter: Takeshi Yamamuro
> Assignee: Takeshi Yamamuro
> Priority: Major
> Labels: spark
>
> In machine learning and statistics, feature selection is one of useful
> techniques to choose a subset of relevant data in model construction for
> simplification of models and shorter training times. scikit-learn has some
> APIs for feature selection
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this
> selection is too time-consuming process if training data have a large number
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter
> meaningful training data before feature selection. As a pretty simple
> example, Spark might be able to filter out columns with low variances (This
> process is corresponding to `VarianceThreshold` in scikit-learn) by
> implicitly adding a `Project` node in the top of an user plan. Then, the
> Spark optimizer might push down this `Project` node into leaf nodes (e.g.,
> `LogicalRelation`) and the plan execution could be significantly faster.
> Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers
> and other OSS functionalities) in this ticket to track them.
> References:
> [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join
> or Not to Join?: Thinking Twice about Joins before Feature Selection,
> Proceedings of SIGMOD, 2016.
> [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)