[ 
https://issues.apache.org/jira/browse/DRILL-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342463#comment-15342463
 ] 

Khurram Faraaz commented on DRILL-4387:
---------------------------------------

The below queries return wrong results. (the problem seems to be there for 
quite some time)

{noformat}
Directory structure is

[root@centos-01 DRILL_4589]# ls
1990  1992  1994  1996  1998  2000  2002  2004  2006  2008  2010  2012  2014
1991  1993  1995  1997  1999  2001  2003  2005  2007  2009  2011  2013  2015
[root@centos-01 DRILL_4589]# cd 1990
[root@centos-01 1990]# ls
Q1  Q2  Q3  Q4
and so on...

Below two queries return 0, I don't think the results are correct, please review

0: jdbc:drill:schema=dfs.tmp> select count(dir0) from `DRILL_4589`;
+---------+
| EXPR$0  |
+---------+
| 0       |
+---------+
1 row selected (9.117 seconds)
0: jdbc:drill:schema=dfs.tmp> select count(dir1) from `DRILL_4589`;
+---------+
| EXPR$0  |
+---------+
| 0       |
+---------+
1 row selected (8.97 seconds)

0: jdbc:drill:schema=dfs.tmp> explain plan for select count(dir0) from 
`DRILL_4589`;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(EXPR$0=[$0])
00-02        Project(EXPR$0=[$0])
00-03          
Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@5275c59a[columns
 = null, isStarQuery = false, isSkipQuery = false]])


0: jdbc:drill:schema=dfs.tmp> explain plan for select count(dir1) from 
`DRILL_4589`;
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(EXPR$0=[$0])
00-02        Project(EXPR$0=[$0])
00-03          
Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@337121ac[columns
 = null, isStarQuery = false, isSkipQuery = false]])
{noformat}

> Improve execution side when it handles skipAll query
> ----------------------------------------------------
>
>                 Key: DRILL-4387
>                 URL: https://issues.apache.org/jira/browse/DRILL-4387
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>             Fix For: 1.6.0
>
>
> DRILL-4279 changes the planner side and the RecordReader in the execution 
> side when they handles skipAll query. However, it seems there are other 
> places in the codebase that do not handle skipAll query efficiently. In 
> particular, in GroupScan or ScanBatchCreator, we will replace a NULL or empty 
> column list with star column. This essentially will force the execution side 
> (RecordReader) to fetch all the columns for data source. Such behavior will 
> lead to big performance overhead for the SCAN operator.
> To improve Drill's performance, we should change those places as well, as a 
> follow-up work after DRILL-4279.
> One simple example of this problem is:
> {code}
>    SELECT DISTINCT substring(dir1, 5) from  dfs.`/Path/To/ParquetTable`;  
> {code}
> The query does not require any regular column from the parquet file. However, 
> ParquetRowGroupScan and ParquetScanBatchCreator will put star column as the 
> column list. In case table has dozens or hundreds of columns, this will make 
> SCAN operator much more expensive than necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to