[GitHub] [hive] szlta opened a new pull request, #3368: HIVE-25827: Parquet file footer is read multiple times, when multiple splits are created in same file

GitBox Tue, 14 Jun 2022 04:46:41 -0700


szlta opened a new pull request, #3368:
URL: https://github.com/apache/hive/pull/3368


   This change refactors the Parquet record reader implementations in Hive 
(ParquetRecordReaderWrapper, VectorizedParquetRecordReader and their base 
class: ParquetRecordReaderBase). The way it's currently done is that the file 
footer is read at least 2 times.
   Also we don't need multiple splits of the same file for this to happen. 
Every file is opened 2 times with the vectorized reader. And yes, e.g. in 
LLAP's case that's two different streams as in that scenario there's a custom 
logic to load the metadata part of the file into metadata cache. If there's a 
cache hit, 1 of the reads would be served from cache, but the other occasion 
reads from file in the dumb way nevertheless.
   
   Key changes:
   
   - After the refactor the Parquet metadata loading happens via a virtual 
method instead and the result is stored in a field so depending on whether the 
job is non-vectorized/vectorized/llap this now can be done in separate ways.
   - Also did some cleanup and restructured common logic into 
ParquetRecordReaderBase.
   - Originally VectorizedParquetRecordReader had a code path for when `if 
(rowGroupOffsets == null) {`. I don't think this is a valid scenario and even 
if it was it wouldn't have worked correctly ever: in `range(split.getStart(), 
split.getEnd())` the split instance is of a ParquetInputSplit, where `getEnd()` 
calculates how many bytes would be read in that split (summed), and it doesn't 
denote a position. Therefore, projections where some columns are omitted would 
produce a bad result. So.., this code path is removed in this refactor too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [hive] szlta opened a new pull request, #3368: HIVE-25827: Parquet file footer is read multiple times, when multiple splits are created in same file

Reply via email to