Arina Ielchiieva created DRILL-5991:
---------------------------------------
Summary: Performance improvements for Hive tables with skip header
/ footer logic
Key: DRILL-5991
URL: https://issues.apache.org/jira/browse/DRILL-5991
Project: Apache Drill
Issue Type: Improvement
Components: Storage - Hive
Affects Versions: 1.12.0
Reporter: Arina Ielchiieva
Currently when Hive table has header / footer all input split of the file are
processed by one reader. This has performance impact better way would be to
keep one reader per split and see if we can figure out a way to tell readers
how many rows they should skip.
To create reader for each input split and maintain skip header / footer
functionality we need to know how many rows are in input split. Unfortunately,
input split does not hold such information, only [number of
bytes|https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/FileSplit.html].
We can't apply skip header functionality for the first input split and skip
footer for the last input either since we don't know how many rows will be
skipped, it can be the situation that we need to skip the whole first input
split and partially second. Also we use [Hadoop
reader|https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/mapred/RecordReader.html]
for the data and don't have information about number of rows in input split.
Possible improvements:
1. For table with header only before creating readers we can start skipping
header and when done, create reader at that position, for other untouched input
splits create separate readers though all readers will be on the same node.
2. Consider Drill text reader usage instead of Hadoop one (as we do for parquet
files) which might provide more flexibility in terms of offsetting bytes etc.
This should be investigated further.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)