theosib-amazon opened a new pull request #953:
URL: https://github.com/apache/parquet-mr/pull/953


   This PR is all performance optimization. In benchmarking with Trino, we find 
query performance to improve from 5% to 15%, depending on the query, and that 
includes all the I/O time from S3.
   
   The main modification is to merge all of LittleEndianDataInputStream 
functionality into ByteBufferInputStream, which yields the following benefits:
   - Elimination of extra layers of abstraction and method call overhead
   - Enable the use of intrinsics for readInt, readLong, etc.
   - Availability of faster access methods like readFully and skipFully, 
without the need for helper functions
   - Reduces some object creation in the performance critical path
   
   This also includes and enables performance optimizations to:
   - ByteBitPackingValuesReader
   - PlainValuesReader
   - RunLengthBitPackingHybridDecoder
   
   Context:
   I've been working on improving Parquet reading performance in Trino, mostly 
by profiling while running performance benchmarks and TPCDS queries. This PR is 
a subset of the changes I made that have more than doubled the performance of a 
lot of TPCDS queries (wall clock time, including the S3 access time). If you 
are kind enough to accept these changes, I have more I would like to contribute.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to