[ 
https://issues.apache.org/jira/browse/DRILL-6118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346883#comment-16346883
 ] 

ASF GitHub Bot commented on DRILL-6118:
---------------------------------------

GitHub user arina-ielchiieva opened a pull request:

    https://github.com/apache/drill/pull/1104

    DRILL-6118: Handle item star columns during project / filter push dow…

    …n and directory pruning
    
    1. Added DrillFilterItemStarReWriterRule to re-write item star fields to 
regular field references.
    2. Refactored DrillPushProjectIntoScanRule to handle item star fields, 
factored out helper classes and methods from PreUitl.class.
    3. Fixed issue with dynamic star usage (after Calcite upgrade old usage of 
star was still present, replaced WILDCARD -> DYNAMIC_STAR  for clarity).
    4. Added unit tests to check project / filter push down and directory 
pruning with item star.
    
    Details in [DRILL-6118](https://issues.apache.org/jira/browse/DRILL-6118).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/arina-ielchiieva/drill DRILL-6118

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1104.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1104
    
----
commit 4673bfb593ca6422d58fa9e0e6eb281a69f1ed69
Author: Arina Ielchiieva <arina.yelchiyeva@...>
Date:   2017-12-21T17:31:00Z

    DRILL-6118: Handle item star columns during project / filter push down and 
directory pruning
    
    1. Added DrillFilterItemStarReWriterRule to re-write item star fields to 
regular field references.
    2. Refactored DrillPushProjectIntoScanRule to handle item star fields, 
factored out helper classes and methods from PreUitl.class.
    3. Fixed issue with dynamic star usage (after Calcite upgrade old usage of 
star was still present, replaced WILDCARD -> DYNAMIC_STAR  for clarity).
    4. Added unit tests to check project / filter push down and directory 
pruning with item star.

----


> Handle item star columns during project  /  filter push down and directory 
> pruning
> ----------------------------------------------------------------------------------
>
>                 Key: DRILL-6118
>                 URL: https://issues.apache.org/jira/browse/DRILL-6118
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.12.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>            Priority: Major
>              Labels: doc-impacting
>             Fix For: 1.13.0
>
>
> Project push down, filter push down and partition pruning does not work with 
> dynamically expanded column with is represented as star in ITEM operator: 
> _ITEM($0, 'column_name')_ where $0 is a star.
>  This often occurs when view, sub-select or cte with star is issued.
>  To solve this issue we can create {{DrillFilterItemStarReWriterRule}} which 
> will rewrite such ITEM operator before filter push down and directory 
> pruning. For project into scan push down logic will be handled separately in 
> already existing rule {{DrillPushProjectIntoScanRule}}. Basically, we can 
> consider the following queries the same: 
>  {{select col1 from t}}
>  {{select col1 from (select * from t)}}
> *Use cases*
> Since item star columns where not considered during project / filter push 
> down and directory pruning, push down and pruning did not happen. This was 
> causing Drill to read all columns from file (when only several are needed) or 
> ready all files instead. Views with star query is the most common example. 
> Such behavior significantly degrades performance for item star queries 
> comparing to queries without item star.
> *EXAMPLES*
> *Data set* 
> will create table with three files each in dedicated sub-folder:
> {noformat}
> use dfs.tmp;
> create table `order_ctas/t1` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-01' and date '1992-01-03';
> create table `order_ctas/t2` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-04' and date '1992-01-06';
> create table `order_ctas/t3` as select cast(o_orderdate as date) as 
> o_orderdate from cp.`tpch/orders.parquet` where o_orderdate between date 
> '1992-01-07' and date '1992-01-09';
> {noformat}
> *Filter push down*
> {{select * from order_ctas where o_orderdate = date '1992-01-01'}} will read 
> only one file
> {noformat}
> 00-00    Screen
> 00-01      Project(**=[$0])
> 00-02        Project(T1¦¦**=[$0])
> 00-03          SelectionVectorRemover
> 00-04            Filter(condition=[=($1, 1992-01-01)])
> 00-05              Project(T1¦¦**=[$0], o_orderdate=[$1])
> 00-06                Scan(groupscan=[ParquetGroupScan 
> [entries=[ReadEntryWithPath [path=/tmp/order_ctas/t1/0_0_0.parquet]], 
> selectionRoot=/tmp/order_ctas, numFiles=1, numRowGroups=1, 
> usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where o_orderdate = date 
> '1992-01-01'}} will ready all three files
> {noformat}
> 00-00    Screen
> 00-01      Project(**=[$0])
> 00-02        SelectionVectorRemover
> 00-03          Filter(condition=[=(ITEM($0, 'o_orderdate'), 1992-01-01)])
> 00-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Directory pruning*
> {{select * from order_ctas where dir0 = 't1'}} will read data only from one 
> folder
> {noformat}
> 00-00    Screen
> 00-01      Project(**=[$0])
> 00-02        Project(**=[$0])
> 00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet]], selectionRoot=/tmporder_ctas, 
> numFiles=1, numRowGroups=1, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> {{select * from (select * from order_ctas) where dir0 = 't1'}} will read 
> content of all three folders
> {noformat}
> 00-00    Screen
> 00-01      Project(**=[$0])
> 00-02        SelectionVectorRemover
> 00-03          Filter(condition=[=(ITEM($0, 'dir0'), 't1')])
> 00-04            Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> *Project into Scan push down*
> {{select o_orderdate, count(1) from order_ctas group by o_orderdate}} will 
> ready only one column from the files
> {noformat}
> 00-00    Screen
> 00-01      Project(o_orderdate=[$0], EXPR$1=[$1])
> 00-02        HashAgg(group=[{0}], EXPR$1=[COUNT()])
> 00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`o_orderdate`]]])
> {noformat}
> {{select o_orderdate, count(1) from (select * from order_ctas) group by 
> o_orderdate}} will ready all columns from the files
> {noformat}
>   00-00    Screen
> 00-01      Project(col_vrchr=[$0], EXPR$1=[$1])
> 00-02        StreamAgg(group=[{0}], EXPR$1=[COUNT()])
> 00-03          Sort(sort0=[$0], dir0=[ASC])
> 00-04            Project(col_vrchr=[ITEM($0, 'o_orderdate')])
> 00-05         Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath 
> [path=/tmp/order_ctas/t1/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t2/0_0_0.parquet], ReadEntryWithPath 
> [path=/tmp/order_ctas/t3/0_0_0.parquet]], selectionRoot=/tmp/order_ctas, 
> numFiles=3, numRowGroups=3, usedMetadataFile=false, columns=[`**`]]])
> {noformat}
> This Jira aims to fix all three described cases above in order to improve 
> performance for queries with item star columns.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to