[
https://issues.apache.org/jira/browse/DRILL-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15120752#comment-15120752
]
Jinfeng Ni commented on DRILL-4279:
-----------------------------------
Yes, for parquet, JSON, text, the plan would be exactly the same as what you
described (though for parquet, count(*) would be converted to DirectScan).
For schemed table, planner will put one column from rowType into the column
list. Another choice is to delay such decision to record reader, which has to
figure out which column to read for each minor fragment. The reasons that I
feel make sense to do that in planner 1) planner already has such information,
why not use that and in stead delay to record reader? 2) the plan would be
consistent to execution, in terms of which column to read.
I agree that we should make the plan output clean and meaningful. The format
you list makes sense to me. To make such change would require change in many
pre-commit testcases, since the expected output have to be updated accordingly.
I'll open a separate JIRA.
> Improve query plan when no column is required from SCAN
> -------------------------------------------------------
>
> Key: DRILL-4279
> URL: https://issues.apache.org/jira/browse/DRILL-4279
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Reporter: Jinfeng Ni
> Assignee: Jinfeng Ni
>
> When query does not specify any specific column to be returned SCAN, for
> instance,
> {code}
> Q1: select count(*) from T1;
> Q2: select 1 + 100 from T1;
> Q3: select 1.0 + random() from T1;
> {code}
> Drill's planner would use a ColumnList with * column, plus a SKIP_ALL mode.
> However, the MODE is not serialized / deserialized. This leads to two
> problems.
> 1). The EXPLAIN plan is confusing, since there is no way to different from a
> "SELECT * " query from this SKIP_ALL mode.
> For instance,
> {code}
> explain plan for select count(*) from dfs.`/Users/jni/work/data/yelp/t1`;
> 00-03 Project($f0=[0])
> 00-04 Scan(groupscan=[EasyGroupScan
> [selectionRoot=file:/Users/jni/work/data/yelp/t1, numFiles=2, columns=[`*`],
> files= ...
> {code}
> 2) If the query is to be executed distributed / parallel, the missing
> serialization of mode would means some Fragment is fetching all the columns,
> while some Fragment is skipping all the columns. That will cause execution
> error.
> For instance, by changing slice_target to enforce the query to be executed in
> multiple fragments, it will hit execution error.
> {code}
> select count(*) from dfs.`/Users/jni/work/data/yelp/t1`;
> org.apache.drill.common.exceptions.UserRemoteException: DATA_READ ERROR:
> Error parsing JSON - You tried to start when you are using a ValueWriter of
> type NullableBitWriterImpl.
> {code}
> Directory "t1" just contains two yelp JSON files.
> Ideally, I think when no columns is required from SCAN, the explain plan
> should show an empty of column list. The MODE of SKIP_ALL together with star
> * column seems to be confusing and error prone.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)