[jira] [Commented] (DRILL-4279) Improve query plan when no column is required from SCAN

Jinfeng Ni (JIRA) Thu, 28 Jan 2016 12:10:13 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122243#comment-15122243
 ]


Jinfeng Ni commented on DRILL-4279:
-----------------------------------

I revised the patch for DRILL-4279, based on Jacques's comment. See the revised 
patch in PR: https://github.com/apache/drill/pull/342/files

For skipAll query, the planner now will always have "columns = [ ] " in the 
scan section. When columns = [] in the plan, it informs the execution that this 
is a skipAll scan.  At execution,  the each reader has the freedom to decide 
how to handle the skipAll scan.  By default,  it will use "*", the same 
behavior as the today's behavior.  The behavior for Text /JSON / Parquet is 
override, to make it perform better than today's behavior. 

For other reader,  I'm thinking to open a separate JIRA to make them support 
skipAll mode.  [~jnadeau],  does this sound reasonable to you?



> Improve query plan when no column is required from SCAN
> -------------------------------------------------------
>
>                 Key: DRILL-4279
>                 URL: https://issues.apache.org/jira/browse/DRILL-4279
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> When query does not specify any specific column to be returned SCAN,  for 
> instance,
> {code}
> Q1:  select count(*) from T1;
> Q2:  select 1 + 100 from T1;
> Q3:  select  1.0 + random() from T1; 
> {code}
> Drill's planner would use a ColumnList with * column, plus a SKIP_ALL mode. 
> However, the MODE is not serialized / deserialized. This leads to two 
> problems.
> 1).  The EXPLAIN plan is confusing, since there is no way to different from a 
> "SELECT * " query from this SKIP_ALL mode. 
> For instance, 
> {code}
> explain plan for select count(*) from dfs.`/Users/jni/work/data/yelp/t1`;
> 00-03          Project($f0=[0])
> 00-04            Scan(groupscan=[EasyGroupScan 
> [selectionRoot=file:/Users/jni/work/data/yelp/t1, numFiles=2, columns=[`*`], 
> files= ... 
> {code} 
> 2) If the query is to be executed distributed / parallel,  the missing 
> serialization of mode would means some Fragment is fetching all the columns, 
> while some Fragment is skipping all the columns. That will cause execution 
> error.
> For instance, by changing slice_target to enforce the query to be executed in 
> multiple fragments, it will hit execution error. 
> {code}
> select count(*) from dfs.`/Users/jni/work/data/yelp/t1`;
> org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: 
> Error parsing JSON - You tried to start when you are using a ValueWriter of 
> type NullableBitWriterImpl.
> {code}
> Directory "t1" just contains two yelp JSON files. 
> Ideally, I think when no columns is required from SCAN, the explain plan 
> should show an empty of column list. The MODE of SKIP_ALL together with star 
> * column seems to be confusing and error prone. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4279) Improve query plan when no column is required from SCAN

Reply via email to