[ https://issues.apache.org/jira/browse/HIVEMALL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Takeshi Yamamuro updated HIVEMALL-181: -------------------------------------- Attachment: fig3.png fig2.png > Plan rewriting rules to filter meaningful training data before feature > selections > --------------------------------------------------------------------------------- > > Key: HIVEMALL-181 > URL: https://issues.apache.org/jira/browse/HIVEMALL-181 > Project: Hivemall > Issue Type: Improvement > Reporter: Takeshi Yamamuro > Assignee: Takeshi Yamamuro > Priority: Major > Labels: spark > Attachments: fig1.png, fig2.png, fig3.png > > > In machine learning and statistics, feature selection is one of useful > techniques to choose a subset of relevant data in model construction for > simplification of models and shorter training times, e.g., scikit-learn has > some APIs for feature selection > ([http://scikit-learn.org/stable/modules/feature_selection.html]). But, this > selection is too time-consuming process if training data have a large number > of columns and rows (For example, the number of columns could frequently go > over 1,000 in real business use cases). > An objective of this ticket is to implement plan rewriting rules in Spark > Catalyst to filter meaningful training data before feature selection. We > assume a workflow below from data extraction to model training; > !fig1.png! > In the example workflow above, one prepares raw training data, R(v1, v2, v3, > v4) in the figure, by joining and projecting input data (R1, R2, and R3) in > various datasources (HDFS, S3, JDBC, ...), then, to choose a releavant subset > (the red box) of the raw data, sampling and feature selection apply to them. > In real business use cases, it sometimes happens that raw training data have > many meaningless columns because of historical reasons (e.g., redundant > schema designs). So, if we could filter out these meaningless data in the > phase of data extraction, we should efficiently process the data extraction > itself and following feature selection. In the example above, we actually > need not join the relation R3 because all the columns in the relation are > filtered out in feature selection. Also, the join processing should be faster > if we could sample data directly in the input data (R1 and R2). This > optimized workflow is as following; > !fig2.png! > This optimization might be achived by rewriting a plan tree for data > extraction as following; > !fig3.png! > Since Spark already has a pluggable optimizer interface > (extendedOperatorOptimizationRules) and a framework to collect data > statistics for input data in data sources, the major tasks of this ticket are > to add plan rewriting rules to filter meaningful training data before feature > selections. > As a pretty simple task, Spark might have a rule to filter out columns with > low variances (This process is corresponding to `VarianceThreshold` in > scikit-learn) by implicitly adding a `Project` node in the top of an user > plan. Then, the Spark optimizer might push down this `Project` node into leaf > nodes (e.g., `LogicalRelation`) and the plan execution could be significantly > faster. Moreover, more sophisticated techniques have been proposed in [1, 2]. > I will make pull requests as sub-tasks and put relevant activities (papers > and other OSS functionalities) in this ticket to track them. > > References: > [1] Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu, To Join > or Not to Join?: Thinking Twice about Joins before Feature Selection, > Proceedings of SIGMOD, 2016. > [2] Vraj Shah, Arun Kumar, and Xiaojin Zhu, Are key-foreign key joins safe > to avoid when learning high-capacity classifiers?, Proceedings of the VLDB > Endowment, Volume 11 Issue 3, Pages 366-379, 2017. -- This message was sent by Atlassian JIRA (v7.6.3#76005)