Timothy Miller created PARQUET-2135:
---------------------------------------
Summary: Performance optimizations: Merged all
LittleEndianDataInputStream functionality into ByteBufferInputStream
Key: PARQUET-2135
URL: https://issues.apache.org/jira/browse/PARQUET-2135
Project: Parquet
Issue Type: Improvement
Components: parquet-mr
Affects Versions: 1.12.2
Reporter: Timothy Miller
This PR is all performance optimization. In benchmarking with Trino, we find
query performance to improve from 5% to 15%, depending on the query, and that
includes all the I/O time from S3.
The main modification is to merge all of LittleEndianDataInputStream
functionality into ByteBufferInputStream, which yields the following benefits:
* Elimination of extra layers of abstraction and method call overhead
* Enable the use of intrinsics for readInt, readLong, etc.
* Availability of faster access methods like readFully and skipFully, without
the need for helper functions
* Reduces some object creation in the performance critical path
This also includes and enables performance optimizations to:
* ByteBitPackingValuesReader
* PlainValuesReader
* RunLengthBitPackingHybridDecoder
Context:
I've been working on improving Parquet reading performance in Trino, mostly by
profiling while running performance benchmarks and TPCDS queries. This PR is a
subset of the changes I made that have more than doubled the performance of a
lot of TPCDS queries (wall clock time, including the S3 access time). If you
are kind enough to accept these changes, I have more I would like to contribute.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)