salim achouche created DRILL-6301:
-------------------------------------
Summary: Parquet Performance Analysis
Key: DRILL-6301
URL: https://issues.apache.org/jira/browse/DRILL-6301
Project: Apache Drill
Issue Type: Task
Components: Storage - Parquet
Reporter: salim achouche
Assignee: salim achouche
Fix For: 1.14.0
_*Description -*_
* DRILL-5846 is meant to improve the Flat Parquet reader performance
* The associated implementation resulted in a 2x - 4x performance improvement
* Though during the review process ([pull
request|[https://github.com/apache/drill/pull/1060])] few key questions arised
*_Intermediary Processing via Direct Memory vs Byte Arrays_*
* The main reasons for using byte arrays for intermediary processing is to a)
avoid the high cost of the DrillBuf checks (especially the reference counting)
and b) benefit from some observed Java optimizations when accessing byte arrays
* Starting with version 1.12.0, the DrillBuf enablement checks have been
refined so that memory access and reference counting checks can be enabled
independently
* Benchmarking of Java's Direct Memory unsafe method using JMH indicates the
performance gap between heap vs direct memory is very narrow except for few
use-cases
* There are also concerns that the extra copy step (from direct memory into
byte arrays) will have a negative effect on performance; note that this
overhead was not observed using Intel's Vtune as the intermediary buffer were
a) pinned to a single CPU, b) reused, and c) small enough to remain in the L1
cache during columnar processing.
_*Goal*_
* The Flat Parquet reader is amongst the few Drill columnar operators
* It is imperative that we agree on the most optimal processing pattern so
that the decisions that we take within this Jira are not only applied to
Parquet but to all Drill columnar operators
_*Methodology*_
* **Lot of profiling and optimizations have been done within DRILL-5846
* We need to
## Assess the performance impact of using intermediary byte arrays (as
described above)
## Prototype a solution using Direct Memory and DrillBuf checks off, access
checks on, all checks on
## Make an educated decision on which processing pattern should be adopted
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)