[ 
https://issues.apache.org/jira/browse/DRILL-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15120588#comment-15120588
 ] 

Jinfeng Ni commented on DRILL-4279:
-----------------------------------

Worked a patch to address the issues reported in DRILL-4279.

The basic idea is to use different approaches, when no column is not required 
from scan operator.
1) If data source is schemed, use the first column in the schema.
2) Use 'columns[0]' for text reader.
3) Use the current skip_all reader for JSON input.
4) Use a default column name for other schema-less input.

Therefore,  for query: 
{code}
select count(*) from some_text_file; 
{code}

Previously, the plan will have "columns = [*]" in the SCAN operator. Now with 
this patch,  we will have "columns = [ ]". The empty list of column indicating 
no column is required by the down-stream operator.  

At execution side (RecordReader),  text reader will use 'columns[0]', while 
json reader will use the currently skip_all reader, and parquet reader will 
read a default column, in stead of * (which means reading all the columns).  

For schemed table, if the down-stream operator does not require any column from 
SCAN, then query planner rule will use the first column in the schemed table. 
The RecordReader will just read the first column.

Did some preliminary performance comparison, with tpcds like text format files.

Both numbers are for warm run. 

On master:
{code}
0: jdbc:drill:zk=local> select count(*) from 
dfs.`/Users/jni/work/data/text/tpcds_catalog_sales`;
+-----------+
|  EXPR$0   |
+-----------+
| 17298576  |
+-----------+
1 row selected (7 seconds)
{code} 

On this patch:
{code}
0: jdbc:drill:zk=local> select count(*) from 
dfs.`/Users/jni/work/data/text/tpcds_catalog_sales`;
+-----------+
|  EXPR$0   |
+-----------+
| 17298576  |
+-----------+
1 row selected (4.088 seconds)
0: jdbc:drill:zk=local> explain plan for select count(*) from 
dfs.`/Users/jni/work/data/text/tpcds_catalog_sales`;
{code}

The query running time is reduced from 7 seconds to 4 seconds. The profile 
shows that the memory used by each SCAN minor fragment is reduced from 6M to 
2M. 

> Improve query plan when no column is required from SCAN
> -------------------------------------------------------
>
>                 Key: DRILL-4279
>                 URL: https://issues.apache.org/jira/browse/DRILL-4279
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> When query does not specify any specific column to be returned SCAN,  for 
> instance,
> {code}
> Q1:  select count(*) from T1;
> Q2:  select 1 + 100 from T1;
> Q3:  select  1.0 + random() from T1; 
> {code}
> Drill's planner would use a ColumnList with * column, plus a SKIP_ALL mode. 
> However, the MODE is not serialized / deserialized. This leads to two 
> problems.
> 1).  The EXPLAIN plan is confusing, since there is no way to different from a 
> "SELECT * " query from this SKIP_ALL mode. 
> For instance, 
> {code}
> explain plan for select count(*) from dfs.`/Users/jni/work/data/yelp/t1`;
> 00-03          Project($f0=[0])
> 00-04            Scan(groupscan=[EasyGroupScan 
> [selectionRoot=file:/Users/jni/work/data/yelp/t1, numFiles=2, columns=[`*`], 
> files= ... 
> {code} 
> 2) If the query is to be executed distributed / parallel,  the missing 
> serialization of mode would means some Fragment is fetching all the columns, 
> while some Fragment is skipping all the columns. That will cause execution 
> error.
> For instance, by changing slice_target to enforce the query to be executed in 
> multiple fragments, it will hit execution error. 
> {code}
> select count(*) from dfs.`/Users/jni/work/data/yelp/t1`;
> org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR: 
> Error parsing JSON - You tried to start when you are using a ValueWriter of 
> type NullableBitWriterImpl.
> {code}
> Directory "t1" just contains two yelp JSON files. 
> Ideally, I think when no columns is required from SCAN, the explain plan 
> should show an empty of column list. The MODE of SKIP_ALL together with star 
> * column seems to be confusing and error prone. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to