[ 
https://issues.apache.org/jira/browse/PARQUET-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516909#comment-17516909
 ] 

Timothy Miller commented on PARQUET-2135:
-----------------------------------------

Extra note:

The reason PlainValuesReader still includes an unused 
LittleEndianDataInputStream member is because if I don't, the build will fail, 
indicating an incompatible API change.

> Performance optimizations: Merged all LittleEndianDataInputStream 
> functionality into ByteBufferInputStream
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: PARQUET-2135
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2135
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.12.2
>            Reporter: Timothy Miller
>            Priority: Major
>
> This PR is all performance optimization. In benchmarking with Trino, we find 
> query performance to improve from 5% to 15%, depending on the query, and that 
> includes all the I/O time from S3.
> The main modification is to merge all of LittleEndianDataInputStream 
> functionality into ByteBufferInputStream, which yields the following benefits:
>  * Elimination of extra layers of abstraction and method call overhead
>  * Enable the use of intrinsics for readInt, readLong, etc.
>  * Availability of faster access methods like readFully and skipFully, 
> without the need for helper functions
>  * Reduces some object creation in the performance critical path
> This also includes and enables performance optimizations to:
>  * ByteBitPackingValuesReader
>  * PlainValuesReader
>  * RunLengthBitPackingHybridDecoder
> Context:
> I've been working on improving Parquet reading performance in Trino, mostly 
> by profiling while running performance benchmarks and TPCDS queries. This PR 
> is a subset of the changes I made that have more than doubled the performance 
> of a lot of TPCDS queries (wall clock time, including the S3 access time). If 
> you are kind enough to accept these changes, I have more I would like to 
> contribute.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to