theosib-amazon opened a new pull request, #960:
URL: https://github.com/apache/parquet-mr/pull/960

   I broke up https://github.com/apache/parquet-mr/pull/953 into more 
digestible pieces. This new PR is the lowest level set of changes. By 
themselves, these additions to ByteBufferInputStream don't yield much 
improvement, so future PRs will include modifications to other source files 
that take advantage of this new functionality.
   
   The complete set of changes (including subsequent PRs) is for performance 
optimization. In benchmarking with Trino, we find query performance to improve 
from 5% to 15%, depending on the query, and that includes all the I/O time from 
S3.
   
   All of LittleEndianDataInputStream functionality is moved into 
ByteBufferInputStream, without changing any pre-existing interfaces or 
functionality. These changes yield the following benefits:
   - Elimination of extra layers of abstraction and method call overhead
   - Enable the use of intrinsics for readInt, readLong, etc.
   - Availability of faster access methods like readFully and skipFully, 
without the need for helper functions
   
   This PR also marks LittleEndianDataInputStream as deprecated.
   
   Context:
   I've been working on improving Parquet reading performance in Trino, mostly 
by profiling while running performance benchmarks and TPCDS queries. This PR is 
a subset of the changes I made that have more than doubled the performance of a 
lot of TPCDS queries (wall clock time, including the S3 access time). If you 
are kind enough to accept these changes, I look forward to offering further 
performance improvements.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to