Jinfeng Ni created DRILL-5542:
---------------------------------
Summary: Scan unnecessary adds implicit columns to ScanRecordBatch
for select * query
Key: DRILL-5542
URL: https://issues.apache.org/jira/browse/DRILL-5542
Project: Apache Drill
Issue Type: Bug
Components: Execution - Relational Operators
Reporter: Jinfeng Ni
It seems that Drill would add several implicit columns (`fqn`, `filepath`,
`filename`, `suffix`) to ScanBatch, where it's actually not required at
downstream operator. Although those implicit columns would be dropped off later
on, it increases both memory and CPU overhead.
1. JSON
```
{a: 100}
```
{code}
select * from dfs.tmp.`1.json`;
+------+
| a |
+------+
| 100 |
+------+
{code}
The schema from ScanRecordBatch is :
{code}
[ schema:
BatchSchema [fields=[fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL),
filename(VARCHAR:OPTIONAL), suffix(VARCHAR:OPTIONAL), a(BIGINT:OPTIONAL)],
selectionVector=NONE],
{code}
2. Parquet
{code}
elect * from cp.`tpch/nation.parquet`;
+--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
| n_nationkey | n_name | n_regionkey |
n_comment
|
+--------------+-----------------+--------------+---------------------------------------------------------------------------------------------------------------------+
| 0 | ALGERIA | 0 | haggle. carefully final
deposits detect slyly agai
|
...
{code}
The schema of ScanRecordBatch:
{code}
schema:
BatchSchema [fields=[n_nationkey(INT:REQUIRED), n_name(VARCHAR:REQUIRED),
n_regionkey(INT:REQUIRED), n_comment(VARCHAR:REQUIRED), fqn(VARCHAR:OPTIONAL),
filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL),
suffix(VARCHAR:OPTIONAL)], selectionVector=NONE],
{code}
3. Text
{code}
cat 1.csv
a, b, c
select * from dfs.tmp.`1.csv`;
+----------------+
| columns |
+----------------+
| ["a","b","c"] |
+----------------+
{code}
Schema of ScanRecordBatch
{code}
schema:
BatchSchema [fields=[columns(VARCHAR:REPEATED)[$data$(VARCHAR:REQUIRED)],
fqn(VARCHAR:OPTIONAL), filepath(VARCHAR:OPTIONAL), filename(VARCHAR:OPTIONAL),
suffix(VARCHAR:OPTIONAL)], selectionVector=NONE],
{code}
If implicit columns are not part of query result of `select * query`, then Scan
operator should not populate those implicit columns.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)