salim achouche commented on DRILL-6301:

Created the following 
 which contains several JMH benchmarks to help answer the questions posed by 
this Jira.

> Parquet Performance Analysis
> ----------------------------
>                 Key: DRILL-6301
>                 URL: https://issues.apache.org/jira/browse/DRILL-6301
>             Project: Apache Drill
>          Issue Type: Task
>          Components: Storage - Parquet
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.14.0
> _*Description -*_
>  * DRILL-5846 is meant to improve the Flat Parquet reader performance
>  * The associated implementation resulted in a 2x - 4x performance improvement
>  * Though during the review process ([pull 
> request|[https://github.com/apache/drill/pull/1060])] few key questions arised
> *_Intermediary Processing via Direct Memory vs Byte Arrays_*
>  * The main reasons for using byte arrays for intermediary processing is to 
> a) avoid the high cost of the DrillBuf checks (especially the reference 
> counting) and b) benefit from some observed Java optimizations when accessing 
> byte arrays
>  * Starting with version 1.12.0, the DrillBuf enablement checks have been 
> refined so that memory access and reference counting checks can be enabled 
> independently
>  * Benchmarking of Java's Direct Memory unsafe method using JMH indicates the 
> performance gap between heap vs direct memory  is very narrow except for few 
> use-cases
>  * There are also concerns that the extra copy step (from direct memory into 
> byte arrays) will have a negative effect on performance; note that this 
> overhead was not observed using Intel's Vtune as the intermediary buffer were 
> a) pinned to a single CPU, b) reused, and c) small enough to remain in the L1 
> cache during columnar processing.
> _*Goal*_ 
>  * The Flat Parquet reader is amongst the few Drill columnar operators
>  * It is imperative that we agree on the most optimal processing pattern so 
> that the decisions that we take within this Jira are not only applied to 
> Parquet but to all Drill columnar operators   
> _*Methodology*_ 
>  # Assess the performance impact of using intermediary byte arrays (as 
> described above)
>  # Prototype a solution using Direct Memory and DrillBuf checks off, access 
> checks on, all checks on
>  # Make an educated decision on which processing pattern should be adopted
>  # Decide whether it is ok to use Java's unsafe API (and through what 
> mechanism) on byte arrays (when the use of byte arrays is a necessity)

This message was sent by Atlassian JIRA

Reply via email to