[jira] [Comment Edited] (DRILL-5847) Flat Parquet Reader Performance Analysis

salim achouche (JIRA) Fri, 06 Oct 2017 13:27:23 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194035#comment-16194035
 ]


salim achouche edited comment on DRILL-5847 at 10/6/17 8:26 PM:
----------------------------------------------------------------

Summary of the CPU based Enhancements:
*Improve CPU Cache Locality -*
* The current implementation performs row based processing (foreach row do;  
for each column do; ..)
* This is done to compute a batch size that can fit within a fixed memory size
* Unfortunately, even this expensive logic doesn't work as data is populated 
using the Value Vectors setSafe(...) APIs
* This API doesn't provide any feedback to caller as it automatically extends 
the allocator if the new value doesn't fit
* We propose to switch to a columnar processing to maximize CPU cache locality
*NOTE -*
* Memory batch size enforcement will be done under another JIRA that will 
target all operators (not just Parquet)
* [~Paul.Rogers] is leading this effort; I'll be implementing the Parquet 
scanner memory enforcement

*CPU Vectorization -*
* The Java Hotspot can produce vectorized code for hot loops
* Unfortunately, this automatic optimization is brittle as there are many 
conditions for the optimizer to produce vectorize code
* Please refer to the [Vectorization in HotSpot 
JVM|http://cr.openjdk.java.net/~vlivanov/talks/2017_Vectorization_in_HotSpot_JVM.pdf]
 for more information on this topic
* Prototyping indicated the JVM does its best optimization within scalar heap 
arrays (byte and integer)
* New Bulk APIs will be added at the Column Reader and Value Vectors so that 
parsing and loading is done using vectorized loops
*NOTE -*
* The main downside of this implementation is the need to buffer data
* The issue per say is not the extra memory as it fairly small (~128K per 
column)
* Instead, it is the extra memory access which can hurt the overall memory 
bandwidth
* Testing indicated this will not be a problem as each CPU core has plenty of 
memory bandwidth; please refer to the attached document (appendix) where I 
discussed when to use such optimization pattern
* Of course, avoiding extra memory access would be ideal but the alternative 
solutions are not currently very appealing
* a) Use JNI; following such a pattern might slowly pull us towards a C++ 
centric implementation
* b) Use a Java CPU Vectorization library that will enable us to optimize our 
code without overhead; Java9's project Panama is currently still under 
development and thus not available for at least a couple of years
   


was (Author: sachouche):
Summary of the CPU based Enhancements:
*Improve CPU Cache Locality -*
* The current implementation performs row based processing (foreach row do;  
for each column do; ..)
* This is done to compute a batch size that can fit within a fixed memory size
* Unfortunately, even this expensive logic doesn't work as data is populated 
using the Value Vectors setSafe(...) APIs
* This API doesn't provide any feedback to caller as it automatically extends 
the allocator if the new value doesn't fit
* We propose to switch to a columnar processing to maximize CPU cache locality
*NOTE -*
* Memory batch size enforcement will be done under another JIRA that will 
target all operators (not just Parquet)
* [~Paul.Rogers] is leading this effort; I'll be implementing the Parquet 
scanner memory enforcement

*CPU Vectorization -*
* The Java Hotspot can produce vectorized code for hot loops
* Unfortunately, this automatic optimization is brittle as there are many 
conditions for the optimizer to produce vectorize code
* Please refer to the [Vectorization in HotSpot 
JVM|http://cr.openjdk.java.net/~vlivanov/talks/2017_Vectorization_in_HotSpot_JVM.pdf]
 for more information on this topic
* Prototyping indicated the JVM does its best optimization within scalar heap 
arrays (byte and integer)
* New Bulk APIs will be added at the Column Reader and Value Vectors so that 
parsing and loading is done using vectorized loops
*NOTE -*
* The main downside of this implementation is the need to buffer data
* The issue per say is not the extra memory as it fairly small (~128K per 
column)
* Instead, it is the extra memory access which can hurt the overall memory 
bandwidth
* Testing indicated this will not be a problem as each CPU core has plenty of 
memory bandwidth
* Of course, avoiding extra memory access would be ideal but the alternative 
solutions are not currently very appealing
* a) Use JNI; following such a pattern might slowly pull us towards a C++ 
centric implementation
* b) Use a Java CPU Vectorization library that will enable us to optimize our 
code without overhead; Java9's project Panama is currently still under 
development and thus not available for at least a couple of years
   

> Flat Parquet Reader Performance Analysis
> ----------------------------------------
>
>                 Key: DRILL-5847
>                 URL: https://issues.apache.org/jira/browse/DRILL-5847
>             Project: Apache Drill
>          Issue Type: Sub-task
>          Components: Storage - Parquet
>    Affects Versions: 1.11.0
>            Reporter: salim achouche
>            Assignee: salim achouche
>              Labels: performance
>             Fix For: 1.12.0
>
>         Attachments: Drill Framework Enhancements.pdf
>
>
> This task is to analyze the Flat Parquet Reader logic looking for performance 
> improvements opportunities.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (DRILL-5847) Flat Parquet Reader Performance Analysis

Reply via email to