Takeshi Yamamuro created HIVEMALL-181:
-----------------------------------------
Summary: Plan rewrting rules to filter out meaningless columns
before future selections
Key: HIVEMALL-181
URL: https://issues.apache.org/jira/browse/HIVEMALL-181
Project: Hivemall
Issue Type: Improvement
Reporter: Takeshi Yamamuro
Assignee: Takeshi Yamamuro
In machine learning and statistics, feature selection is a useful techniqe to
choose a subset of relevant features
in model construction for simplification of models and shorter training times.
scikit-learn has some APIs for feature selection
(http://scikit-learn.org/stable/modules/feature_selection.html), but
this selection is too time-consuming process if training data have a large
number of columns
(the number could frequently go over 1,000 in bisiness use cases).
An objective of this ticket is to add new optimizer rules in Spark to filter
out meaningless columns before feature selection.
As a simple example, Spark might be able to filter out columns with low
variances (This process is corresponding to `VarianceThreshold` in scikit-learn)
by implicitly adding a `Project` node in the top of an user plan.
Then, the Spark optimizer might push down this `Project` node into leaf nodes
(e.g., `LogicalRelation`) and
the plan execution could be significantly faster.
Moreover, more sophicated techniques have been proposed in [1, 2].
I will make pull requests as sub-tasks and put relevant activities (papers and
other OSS functinalities)
in this ticket to track them.
References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or
Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings
of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to
avoid when learning high-capacity classifiers?, Proceedings of the VLDB
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)