[ https://issues.apache.org/jira/browse/DRILL-5847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16194035#comment-16194035 ]
salim achouche edited comment on DRILL-5847 at 10/6/17 8:26 PM: ---------------------------------------------------------------- Summary of the CPU based Enhancements: *Improve CPU Cache Locality -* * The current implementation performs row based processing (foreach row do; for each column do; ..) * This is done to compute a batch size that can fit within a fixed memory size * Unfortunately, even this expensive logic doesn't work as data is populated using the Value Vectors setSafe(...) APIs * This API doesn't provide any feedback to caller as it automatically extends the allocator if the new value doesn't fit * We propose to switch to a columnar processing to maximize CPU cache locality *NOTE -* * Memory batch size enforcement will be done under another JIRA that will target all operators (not just Parquet) * [~Paul.Rogers] is leading this effort; I'll be implementing the Parquet scanner memory enforcement *CPU Vectorization -* * The Java Hotspot can produce vectorized code for hot loops * Unfortunately, this automatic optimization is brittle as there are many conditions for the optimizer to produce vectorize code * Please refer to the [Vectorization in HotSpot JVM|http://cr.openjdk.java.net/~vlivanov/talks/2017_Vectorization_in_HotSpot_JVM.pdf] for more information on this topic * Prototyping indicated the JVM does its best optimization within scalar heap arrays (byte and integer) * New Bulk APIs will be added at the Column Reader and Value Vectors so that parsing and loading is done using vectorized loops *NOTE -* * The main downside of this implementation is the need to buffer data * The issue per say is not the extra memory as it fairly small (~128K per column) * Instead, it is the extra memory access which can hurt the overall memory bandwidth * Testing indicated this will not be a problem as each CPU core has plenty of memory bandwidth; please refer to the attached document (appendix) where I discussed when to use such optimization pattern * Of course, avoiding extra memory access would be ideal but the alternative solutions are not currently very appealing * a) Use JNI; following such a pattern might slowly pull us towards a C++ centric implementation * b) Use a Java CPU Vectorization library that will enable us to optimize our code without overhead; Java9's project Panama is currently still under development and thus not available for at least a couple of years was (Author: sachouche): Summary of the CPU based Enhancements: *Improve CPU Cache Locality -* * The current implementation performs row based processing (foreach row do; for each column do; ..) * This is done to compute a batch size that can fit within a fixed memory size * Unfortunately, even this expensive logic doesn't work as data is populated using the Value Vectors setSafe(...) APIs * This API doesn't provide any feedback to caller as it automatically extends the allocator if the new value doesn't fit * We propose to switch to a columnar processing to maximize CPU cache locality *NOTE -* * Memory batch size enforcement will be done under another JIRA that will target all operators (not just Parquet) * [~Paul.Rogers] is leading this effort; I'll be implementing the Parquet scanner memory enforcement *CPU Vectorization -* * The Java Hotspot can produce vectorized code for hot loops * Unfortunately, this automatic optimization is brittle as there are many conditions for the optimizer to produce vectorize code * Please refer to the [Vectorization in HotSpot JVM|http://cr.openjdk.java.net/~vlivanov/talks/2017_Vectorization_in_HotSpot_JVM.pdf] for more information on this topic * Prototyping indicated the JVM does its best optimization within scalar heap arrays (byte and integer) * New Bulk APIs will be added at the Column Reader and Value Vectors so that parsing and loading is done using vectorized loops *NOTE -* * The main downside of this implementation is the need to buffer data * The issue per say is not the extra memory as it fairly small (~128K per column) * Instead, it is the extra memory access which can hurt the overall memory bandwidth * Testing indicated this will not be a problem as each CPU core has plenty of memory bandwidth * Of course, avoiding extra memory access would be ideal but the alternative solutions are not currently very appealing * a) Use JNI; following such a pattern might slowly pull us towards a C++ centric implementation * b) Use a Java CPU Vectorization library that will enable us to optimize our code without overhead; Java9's project Panama is currently still under development and thus not available for at least a couple of years > Flat Parquet Reader Performance Analysis > ---------------------------------------- > > Key: DRILL-5847 > URL: https://issues.apache.org/jira/browse/DRILL-5847 > Project: Apache Drill > Issue Type: Sub-task > Components: Storage - Parquet > Affects Versions: 1.11.0 > Reporter: salim achouche > Assignee: salim achouche > Labels: performance > Fix For: 1.12.0 > > Attachments: Drill Framework Enhancements.pdf > > > This task is to analyze the Flat Parquet Reader logic looking for performance > improvements opportunities. -- This message was sent by Atlassian JIRA (v6.4.14#64029)