szlta opened a new pull request, #3368:
URL: https://github.com/apache/hive/pull/3368
This change refactors the Parquet record reader implementations in Hive
(ParquetRecordReaderWrapper, VectorizedParquetRecordReader and their base
class: ParquetRecordReaderBase). The way it's currently done is that the file
footer is read at least 2 times.
Also we don't need multiple splits of the same file for this to happen.
Every file is opened 2 times with the vectorized reader. And yes, e.g. in
LLAP's case that's two different streams as in that scenario there's a custom
logic to load the metadata part of the file into metadata cache. If there's a
cache hit, 1 of the reads would be served from cache, but the other occasion
reads from file in the dumb way nevertheless.
Key changes:
- After the refactor the Parquet metadata loading happens via a virtual
method instead and the result is stored in a field so depending on whether the
job is non-vectorized/vectorized/llap this now can be done in separate ways.
- Also did some cleanup and restructured common logic into
ParquetRecordReaderBase.
- Originally VectorizedParquetRecordReader had a code path for when `if
(rowGroupOffsets == null) {`. I don't think this is a valid scenario and even
if it was it wouldn't have worked correctly ever: in `range(split.getStart(),
split.getEnd())` the split instance is of a ParquetInputSplit, where `getEnd()`
calculates how many bytes would be read in that split (summed), and it doesn't
denote a position. Therefore, projections where some columns are omitted would
produce a bad result. So.., this code path is removed in this refactor too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]