[ https://issues.apache.org/jira/browse/DRILL-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342463#comment-15342463 ]
Khurram Faraaz commented on DRILL-4387: --------------------------------------- The below queries return wrong results. (the problem seems to be there for quite some time) {noformat} Directory structure is [root@centos-01 DRILL_4589]# ls 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 [root@centos-01 DRILL_4589]# cd 1990 [root@centos-01 1990]# ls Q1 Q2 Q3 Q4 and so on... Below two queries return 0, I don't think the results are correct, please review 0: jdbc:drill:schema=dfs.tmp> select count(dir0) from `DRILL_4589`; +---------+ | EXPR$0 | +---------+ | 0 | +---------+ 1 row selected (9.117 seconds) 0: jdbc:drill:schema=dfs.tmp> select count(dir1) from `DRILL_4589`; +---------+ | EXPR$0 | +---------+ | 0 | +---------+ 1 row selected (8.97 seconds) 0: jdbc:drill:schema=dfs.tmp> explain plan for select count(dir0) from `DRILL_4589`; +------+------+ | text | json | +------+------+ | 00-00 Screen 00-01 Project(EXPR$0=[$0]) 00-02 Project(EXPR$0=[$0]) 00-03 Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@5275c59a[columns = null, isStarQuery = false, isSkipQuery = false]]) 0: jdbc:drill:schema=dfs.tmp> explain plan for select count(dir1) from `DRILL_4589`; +------+------+ | text | json | +------+------+ | 00-00 Screen 00-01 Project(EXPR$0=[$0]) 00-02 Project(EXPR$0=[$0]) 00-03 Scan(groupscan=[org.apache.drill.exec.store.pojo.PojoRecordReader@337121ac[columns = null, isStarQuery = false, isSkipQuery = false]]) {noformat} > Improve execution side when it handles skipAll query > ---------------------------------------------------- > > Key: DRILL-4387 > URL: https://issues.apache.org/jira/browse/DRILL-4387 > Project: Apache Drill > Issue Type: Bug > Reporter: Jinfeng Ni > Assignee: Jinfeng Ni > Fix For: 1.6.0 > > > DRILL-4279 changes the planner side and the RecordReader in the execution > side when they handles skipAll query. However, it seems there are other > places in the codebase that do not handle skipAll query efficiently. In > particular, in GroupScan or ScanBatchCreator, we will replace a NULL or empty > column list with star column. This essentially will force the execution side > (RecordReader) to fetch all the columns for data source. Such behavior will > lead to big performance overhead for the SCAN operator. > To improve Drill's performance, we should change those places as well, as a > follow-up work after DRILL-4279. > One simple example of this problem is: > {code} > SELECT DISTINCT substring(dir1, 5) from dfs.`/Path/To/ParquetTable`; > {code} > The query does not require any regular column from the parquet file. However, > ParquetRowGroupScan and ParquetScanBatchCreator will put star column as the > column list. In case table has dozens or hundreds of columns, this will make > SCAN operator much more expensive than necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)