[ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--------------------------------------
    Description: 
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times, e.g., scikit-learn has 
some APIs for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
selection is too time-consuming process if training data have a large number of 
columns and rows (For example, the number of columns could frequently go over 
1,000 in real business use cases).

An objective of this ticket is to implement plan rewriting rules in Spark 
Catalyst to filter meaningful training data before feature selection. We assume 
a workflow below from data extraction to model training;

!fig1.png!

In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
(the red box) of the raw data, sampling and feature selection apply to them. In 
real business use cases, it sometimes happens that raw training data have many 
meaningless columns because of historical reasons (e.g., redundant schema 
designs). So, if we could filter out these meaningless data in the phase of 
data extraction, we should efficiently process the data extraction itself and 
following feature selection. In the example above, we actually need not join 
the relation R3 because all the columns in the relation are filtered out in 
feature selection. Also, the join processing should be faster if we could 
sample data directly in the input data (R1 and R2). This optimized workflow is 
as following;

!fig2.png!

This optimization might be achived by rewriting a plan tree for data extraction 
as following;

!fig3.png!

Since Spark already has a pluggable optimizer interface 
(extendedOperatorOptimizationRules) and a framework to collect data statistics 
for input data in data sources, the major tasks of this ticket are to add plan 
rewriting rules to filter meaningful training data before feature selections.

As a pretty simple task, Spark might have a rule to filter out columns with low 
variances (This process is corresponding to `VarianceThreshold` in 
scikit-learn) by implicitly adding a `Project` node in the top of an user plan. 
Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophisticated techniques have been proposed in [1, 2, 3].

I will make pull requests as sub-tasks and put relevant activities (reseaches 
and other OSS functionalities) in this ticket to track them.

 

References:
[1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join or 
Not to Join?: Thinking Twice about Joins before Feature Selection, Proceedings 
of SIGMOD, 2016.
[2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.
[3] Z. Zhao, R. Christensen, F. Li, X. Hu, K. Yi, Random Sampling over Joins 
Revisited, Proceedings of SIGMOD, 2018.

  was:
In machine learning and statistics, feature selection is one of useful 
techniques to choose a subset of relevant data in model construction for 
simplification of models and shorter training times, e.g., scikit-learn has 
some APIs for feature selection 
([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
selection is too time-consuming process if training data have a large number of 
columns and rows (For example, the number of columns could frequently go over 
1,000 in real business use cases).

An objective of this ticket is to implement plan rewriting rules in Spark 
Catalyst to filter meaningful training data before feature selection. We assume 
a workflow below from data extraction to model training;

!fig1.png!

In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
(the red box) of the raw data, sampling and feature selection apply to them. In 
real business use cases, it sometimes happens that raw training data have many 
meaningless columns because of historical reasons (e.g., redundant schema 
designs). So, if we could filter out these meaningless data in the phase of 
data extraction, we should efficiently process the data extraction itself and 
following feature selection. In the example above, we actually need not join 
the relation R3 because all the columns in the relation are filtered out in 
feature selection. Also, the join processing should be faster if we could 
sample data directly in the input data (R1 and R2). This optimized workflow is 
as following;

!fig2.png!

This optimization might be achived by rewriting a plan tree for data extraction 
as following;

!fig3.png!

Since Spark already has a pluggable optimizer interface 
(extendedOperatorOptimizationRules) and a framework to collect data statistics 
for input data in data sources, the major tasks of this ticket are to add plan 
rewriting rules to filter meaningful training data before feature selections.

As a pretty simple task, Spark might have a rule to filter out columns with low 
variances (This process is corresponding to `VarianceThreshold` in 
scikit-learn) by implicitly adding a `Project` node in the top of an user plan. 
Then, the Spark optimizer might push down this `Project` node into leaf nodes 
(e.g., `LogicalRelation`) and the plan execution could be significantly faster. 
Moreover, more sophisticated techniques have been proposed in [1, 2, 3].

I will make pull requests as sub-tasks and put relevant activities (reseaches 
and other OSS functionalities) in this ticket to track them.

 

References:
 [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
or Not to Join?: Thinking Twice about Joins before Feature Selection, 
Proceedings of SIGMOD, 2016.
 [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
Endowment, Volume 11 Issue 3, Pages 366-379, 2017.


> Plan rewriting rules to filter meaningful training data before feature 
> selections
> ---------------------------------------------------------------------------------
>
>                 Key: HIVEMALL-181
>                 URL: https://issues.apache.org/jira/browse/HIVEMALL-181
>             Project: Hivemall
>          Issue Type: Improvement
>            Reporter: Takeshi Yamamuro
>            Assignee: Takeshi Yamamuro
>            Priority: Major
>              Labels: spark
>         Attachments: fig1.png, fig2.png, fig3.png
>
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times, e.g., scikit-learn has 
> some APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
> selection is too time-consuming process if training data have a large number 
> of columns and rows (For example, the number of columns could frequently go 
> over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark 
> Catalyst to filter meaningful training data before feature selection. We 
> assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
> v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
> various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
> (the red box) of the raw data, sampling and feature selection apply to them. 
> In real business use cases, it sometimes happens that raw training data have 
> many meaningless columns because of historical reasons (e.g., redundant 
> schema designs). So, if we could filter out these meaningless data in the 
> phase of data extraction, we should efficiently process the data extraction 
> itself and following feature selection. In the example above, we actually 
> need not join the relation R3 because all the columns in the relation are 
> filtered out in feature selection. Also, the join processing should be faster 
> if we could sample data directly in the input data (R1 and R2). This 
> optimized workflow is as following;
> !fig2.png!
> This optimization might be achived by rewriting a plan tree for data 
> extraction as following;
> !fig3.png!
> Since Spark already has a pluggable optimizer interface 
> (extendedOperatorOptimizationRules) and a framework to collect data 
> statistics for input data in data sources, the major tasks of this ticket are 
> to add plan rewriting rules to filter meaningful training data before feature 
> selections.
> As a pretty simple task, Spark might have a rule to filter out columns with 
> low variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn) by implicitly adding a `Project` node in the top of an user 
> plan. Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophisticated techniques have been proposed in [1, 2, 
> 3].
> I will make pull requests as sub-tasks and put relevant activities (reseaches 
> and other OSS functionalities) in this ticket to track them.
>  
> References:
> [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
> [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe to 
> avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.
> [3] Z. Zhao, R. Christensen, F. Li, X. Hu, K. Yi, Random Sampling over Joins 
> Revisited, Proceedings of SIGMOD, 2018.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to