[jira] [Commented] (DRILL-4387) Improve execution side when it handles skipAll query

Jinfeng Ni (JIRA) Wed, 17 Feb 2016 09:10:03 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150792#comment-15150792
 ]


Jinfeng Ni commented on DRILL-4387:
-----------------------------------

Couple of comments I would like to make:

1. If a physical plan comes from query SQL planner, then after DRILL-4279, the 
columns list should be empty, in stead of NULL, for skipAll query. The empty 
column list will go through GroupScan, ScanBatchCreator.  It's in the 
RecordReader where different ways of handling skipAll query will be applied.

2. If a physical plan does not come from query planner, it's possible that the 
"columns" section is missing, leading to NULL for such field.  This mainly 
comes from the old "manually" written physical plan in many unit tests long 
time ago. In the case column list is NULL, Drill still uses "no words means all 
columns" policy, to ensure the compatibility for those old physical plans.





> Improve execution side when it handles skipAll query
> ----------------------------------------------------
>
>                 Key: DRILL-4387
>                 URL: https://issues.apache.org/jira/browse/DRILL-4387
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>             Fix For: 1.6.0
>
>
> DRILL-4279 changes the planner side and the RecordReader in the execution 
> side when they handles skipAll query. However, it seems there are other 
> places in the codebase that do not handle skipAll query efficiently. In 
> particular, in GroupScan or ScanBatchCreator, we will replace a NULL or empty 
> column list with star column. This essentially will force the execution side 
> (RecordReader) to fetch all the columns for data source. Such behavior will 
> lead to big performance overhead for the SCAN operator.
> To improve Drill's performance, we should change those places as well, as a 
> follow-up work after DRILL-4279.
> One simple example of this problem is:
> {code}
>    SELECT DISTINCT substring(dir1, 5) from  dfs.`/Path/To/ParquetTable`;  
> {code}
> The query does not require any regular column from the parquet file. However, 
> ParquetRowGroupScan and ParquetScanBatchCreator will put star column as the 
> column list. In case table has dozens or hundreds of columns, this will make 
> SCAN operator much more expensive than necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4387) Improve execution side when it handles skipAll query

Reply via email to