[ 
https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated HIVEMALL-181:
--------------------------------------
    Attachment:     (was: fig3.png)

> Plan rewriting rules to filter meaningful training data before feature 
> selections
> ---------------------------------------------------------------------------------
>
>                 Key: HIVEMALL-181
>                 URL: https://issues.apache.org/jira/browse/HIVEMALL-181
>             Project: Hivemall
>          Issue Type: Improvement
>            Reporter: Takeshi Yamamuro
>            Assignee: Takeshi Yamamuro
>            Priority: Major
>              Labels: spark
>         Attachments: fig1.png
>
>
> In machine learning and statistics, feature selection is one of useful 
> techniques to choose a subset of relevant data in model construction for 
> simplification of models and shorter training times, e.g., scikit-learn has 
> some APIs for feature selection 
> ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this 
> selection is too time-consuming process if training data have a large number 
> of columns and rows (For example, the number of columns could frequently go 
> over 1,000 in real business use cases).
> An objective of this ticket is to implement plan rewriting rules in Spark 
> Catalyst to filter meaningful training data before feature selection. We 
> assume a workflow below from data extraction to model training;
> !fig1.png!
> In the example workflow above, one prepares raw training data, R(v1, v2, v3, 
> v4) in the figure, by joining and projecting input data (R1, R2, and R3) in 
> various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset 
> (the red box) of the raw data, sampling and feature selection apply to them. 
> In real business use cases, it sometimes happens that raw training data have 
> many meaningless columns because of historical reasons (e.g., redundant 
> schema designs). So, if we could filter out these meaningless data in the 
> phase of data extraction, we should efficiently process the data extraction 
> itself and following feature selection. In the example above, we actually 
> need not join the relation R3 because all the columns in the relation are 
> filtered out in feature selection. Also, the join processing should be faster 
> if we could sample data directly in the input data (R1 and R2). This 
> optimized workflow is as following;
> !fig2.png!
> This optimization might be achived by rewriting a plan tree for data 
> extraction as following;
> !fig3.png!
> Since Spark already has a pluggable optimizer interface 
> (extendedOperatorOptimizationRules) and a framework to collect data 
> statistics for input data in data sources, the major tasks of this ticket are 
> to add plan rewriting rules to filter meaningful training data before feature 
> selections.
> As a pretty simple task, Spark might have a rule to filter out columns with 
> low variances (This process is corresponding to `VarianceThreshold` in 
> scikit-learn) by implicitly adding a `Project` node in the top of an user 
> plan. Then, the Spark optimizer might push down this `Project` node into leaf 
> nodes (e.g., `LogicalRelation`) and the plan execution could be significantly 
> faster. Moreover, more sophisticated techniques have been proposed in [1, 2].
> I will make pull requests as sub-tasks and put relevant activities (papers 
> and other OSS functionalities) in this ticket to track them.
>  
> References:
>  [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join 
> or Not to Join?: Thinking Twice about Joins before Feature Selection, 
> Proceedings of SIGMOD, 2016.
>  [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe 
> to avoid when learning high-capacity classifiers?, Proceedings of the VLDB 
> Endowment, Volume 11 Issue 3, Pages 366-379, 2017.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to