[jira] [Updated] (DRILL-6301) Parquet Performance Analysis

salim achouche (JIRA) Thu, 29 Mar 2018 11:42:12 -0700

     [ 
https://issues.apache.org/jira/browse/DRILL-6301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


salim achouche updated DRILL-6301:
----------------------------------
    Description: 
_*Description -*_
 * DRILL-5846 is meant to improve the Flat Parquet reader performance
 * The associated implementation resulted in a 2x - 4x performance improvement
 * Though during the review process ([pull 
request|[https://github.com/apache/drill/pull/1060])] few key questions arised

 

*_Intermediary Processing via Direct Memory vs Byte Arrays_*
 * The main reasons for using byte arrays for intermediary processing is to a) 
avoid the high cost of the DrillBuf checks (especially the reference counting) 
and b) benefit from some observed Java optimizations when accessing byte arrays
 * Starting with version 1.12.0, the DrillBuf enablement checks have been 
refined so that memory access and reference counting checks can be enabled 
independently
 * Benchmarking of Java's Direct Memory unsafe method using JMH indicates the 
performance gap between heap vs direct memory  is very narrow except for few 
use-cases
 * There are also concerns that the extra copy step (from direct memory into 
byte arrays) will have a negative effect on performance; note that this 
overhead was not observed using Intel's Vtune as the intermediary buffer were 
a) pinned to a single CPU, b) reused, and c) small enough to remain in the L1 
cache during columnar processing.

_*Goal*_ 
 * The Flat Parquet reader is amongst the few Drill columnar operators
 * It is imperative that we agree on the most optimal processing pattern so 
that the decisions that we take within this Jira are not only applied to 
Parquet but to all Drill columnar operators   

_*Methodology*_ 
 # Assess the performance impact of using intermediary byte arrays (as 
described above)
 # Prototype a solution using Direct Memory and DrillBuf checks off, access 
checks on, all checks on
 # Make an educated decision on which processing pattern should be adopted
 # Decide whether it is ok to use Java's unsafe API (and through what 
mechanism) on byte arrays (when the use of byte arrays is a necessity)

 

 

 

 

  was:
_*Description -*_
 * DRILL-5846 is meant to improve the Flat Parquet reader performance
 * The associated implementation resulted in a 2x - 4x performance improvement
 * Though during the review process ([pull 
request|[https://github.com/apache/drill/pull/1060])] few key questions arised

 

*_Intermediary Processing via Direct Memory vs Byte Arrays_*
 * The main reasons for using byte arrays for intermediary processing is to a) 
avoid the high cost of the DrillBuf checks (especially the reference counting) 
and b) benefit from some observed Java optimizations when accessing byte arrays
 * Starting with version 1.12.0, the DrillBuf enablement checks have been 
refined so that memory access and reference counting checks can be enabled 
independently
 * Benchmarking of Java's Direct Memory unsafe method using JMH indicates the 
performance gap between heap vs direct memory  is very narrow except for few 
use-cases
 * There are also concerns that the extra copy step (from direct memory into 
byte arrays) will have a negative effect on performance; note that this 
overhead was not observed using Intel's Vtune as the intermediary buffer were 
a) pinned to a single CPU, b) reused, and c) small enough to remain in the L1 
cache during columnar processing.

_*Goal*_ 
 * The Flat Parquet reader is amongst the few Drill columnar operators
 * It is imperative that we agree on the most optimal processing pattern so 
that the decisions that we take within this Jira are not only applied to 
Parquet but to all Drill columnar operators   

_*Methodology*_
 * **Lot of profiling and optimizations have been done within DRILL-5846
 * We need to 
 ## Assess the performance impact of using intermediary byte arrays (as 
described above)
 ## Prototype a solution using Direct Memory and DrillBuf checks off, access 
checks on, all checks on
 ## Make an educated decision on which processing pattern should be adopted

 

 

 

 


> Parquet Performance Analysis
> ----------------------------
>
>                 Key: DRILL-6301
>                 URL: https://issues.apache.org/jira/browse/DRILL-6301
>             Project: Apache Drill
>          Issue Type: Task
>          Components: Storage - Parquet
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.14.0
>
>
> _*Description -*_
>  * DRILL-5846 is meant to improve the Flat Parquet reader performance
>  * The associated implementation resulted in a 2x - 4x performance improvement
>  * Though during the review process ([pull 
> request|[https://github.com/apache/drill/pull/1060])] few key questions arised
>  
> *_Intermediary Processing via Direct Memory vs Byte Arrays_*
>  * The main reasons for using byte arrays for intermediary processing is to 
> a) avoid the high cost of the DrillBuf checks (especially the reference 
> counting) and b) benefit from some observed Java optimizations when accessing 
> byte arrays
>  * Starting with version 1.12.0, the DrillBuf enablement checks have been 
> refined so that memory access and reference counting checks can be enabled 
> independently
>  * Benchmarking of Java's Direct Memory unsafe method using JMH indicates the 
> performance gap between heap vs direct memory  is very narrow except for few 
> use-cases
>  * There are also concerns that the extra copy step (from direct memory into 
> byte arrays) will have a negative effect on performance; note that this 
> overhead was not observed using Intel's Vtune as the intermediary buffer were 
> a) pinned to a single CPU, b) reused, and c) small enough to remain in the L1 
> cache during columnar processing.
> _*Goal*_ 
>  * The Flat Parquet reader is amongst the few Drill columnar operators
>  * It is imperative that we agree on the most optimal processing pattern so 
> that the decisions that we take within this Jira are not only applied to 
> Parquet but to all Drill columnar operators   
> _*Methodology*_ 
>  # Assess the performance impact of using intermediary byte arrays (as 
> described above)
>  # Prototype a solution using Direct Memory and DrillBuf checks off, access 
> checks on, all checks on
>  # Make an educated decision on which processing pattern should be adopted
>  # Decide whether it is ok to use Java's unsafe API (and through what 
> mechanism) on byte arrays (when the use of byte arrays is a necessity)
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (DRILL-6301) Parquet Performance Analysis

Reply via email to