[
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Takeshi Yamamuro updated HIVEMALL-181:
--------------------------------------
Description:
In machine learning and statistics, feature selection is one of useful
techniques to choose a subset of relevant data in model construction for
simplification of models and shorter training times. scikit-learn has some APIs
for feature selection
([http://scikit-learn.org/stable/modules/feature_selection.html]), but this
selection is too time-consuming process if training data have a large number of
columns (the number could frequently go over 1,000 in business use cases).
An objective of this ticket is to add new optimizer rules in Spark to filter
meaningful training data before feature selection. As a pretty simple example,
Spark might be able to filter out columns with low variances (This process is
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a
`Project` node in the top of an user plan. Then, the Spark optimizer might push
down this `Project` node into leaf nodes (e.g., `LogicalRelation`) and the plan
execution could be significantly faster. Moreover, more sophisticated
techniques have been proposed in [1, 2].
I will make pull requests as sub-tasks and put relevant activities (papers and
other OSS functionalities) in this ticket to track them.
References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join
or Not to Join?: Thinking Twice about Joins before Feature Selection,
Proceedings of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to
avoid when learning high-capacity classifiers?, Proceedings of the VLDB
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.
was:
In machine learning and statistics, feature selection is one of useful
techniques to choose a subset of relevant data in model construction for
simplification of models and shorter training times. scikit-learn has some APIs
for feature selection
([http://scikit-learn.org/stable/modules/feature_selection.html]), but this
selection is too time-consuming process if training data have a large number of
columns (the number could frequently go over 1,000 in business use cases).
An objective of this ticket is to add new optimizer rules in Spark to filter
meaningful training data before feature selection. As a simple example, Spark
might be able to filter out columns with low variances (This process is
corresponding to `VarianceThreshold` in scikit-learn) by implicitly adding a
`Project` node in the top of an user plan.
Then, the Spark optimizer might push down this `Project` node into leaf nodes
(e.g., `LogicalRelation`) and the plan execution could be significantly faster.
Moreover, more sophicated techniques have been proposed in [1, 2].
I will make pull requests as sub-tasks and put relevant activities (papers and
other OSS functinalities) in this ticket to track them.
References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join
or Not to Join?: Thinking Twice about Joins before Feature Selection,
Proceedings of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to
avoid when learning high-capacity classifiers?, Proceedings of the VLDB
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.
> Plan rewrting rules to filter meaningful training data before future
> selections
> -------------------------------------------------------------------------------
>
> Key: HIVEMALL-181
> URL: https://issues.apache.org/jira/browse/HIVEMALL-181
> Project: Hivemall
> Issue Type: Improvement
> Reporter: Takeshi Yamamuro
> Assignee: Takeshi Yamamuro
> Priority: Major
> Labels: spark
>
> In machine learning and statistics, feature selection is one of useful
> techniques to choose a subset of relevant data in model construction for
> simplification of models and shorter training times. scikit-learn has some
> APIs for feature selection
> ([http://scikit-learn.org/stable/modules/feature_selection.html]), but this
> selection is too time-consuming process if training data have a large number
> of columns (the number could frequently go over 1,000 in business use cases).
> An objective of this ticket is to add new optimizer rules in Spark to filter
> meaningful training data before feature selection. As a pretty simple
> example, Spark might be able to filter out columns with low variances (This
> process is corresponding to `VarianceThreshold` in scikit-learn) by
> implicitly adding a `Project` node in the top of an user plan. Then, the
> Spark optimizer might push down this `Project` node into leaf nodes (e.g.,
> `LogicalRelation`) and the plan execution could be significantly faster.
> Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers
> and other OSS functionalities) in this ticket to track them.
> References:
> [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join
> or Not to Join?: Thinking Twice about Joins before Feature Selection,
> Proceedings of SIGMOD, 2016.
> [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)