Github user arina-ielchiieva commented on the issue:
https://github.com/apache/drill/pull/1030
@ppadma performance impact will be only on those tables that have skip
header / footer functionality (one reader per file), for other tables
processing will be the same (one reader per input). Currently for tables that
have skip header / footer functionality and have multiple input splits, these
rows are skipped for each input, thus leaving user with incorrect data set.
Regarding your suggestion, as I have mentioned in my previous comment
(please see links to the code there as well) we use hadoop reader for the data
and don't have information about number of rows in input split (I wish we did,
it would make life much easier).
I have an idea how to parallelize readers when we have only header (though
still all readers will be on the same node) but when we have footer we'll have
to read one file per reader. Also we can consider Drill text reader usage
instead of hadoop one (as we do for parquet files).
Anyway, I suggest we commit these changes and create new Jira for future
improvement of skip header / footer performance. Will this be acceptable?
---