[
https://issues.apache.org/jira/browse/DRILL-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15941356#comment-15941356
]
ASF GitHub Bot commented on DRILL-5351:
---------------------------------------
Github user jinfengni commented on the issue:
https://github.com/apache/drill/pull/781
+1
> Excessive bounds checking in the Parquet reader
> ------------------------------------------------
>
> Key: DRILL-5351
> URL: https://issues.apache.org/jira/browse/DRILL-5351
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: Parth Chandra
> Assignee: Parth Chandra
>
> In profiling the Parquet reader, the variable length decoding appears to be a
> major bottleneck making the reader CPU bound rather than disk bound.
> A yourkit profile indicates the following methods being severe bottlenecks -
> VarLenBinaryReader.determineSizeSerial(long)
> NullableVarBinaryVector$Mutator.setSafe(int, int, int, int, DrillBuf)
> DrillBuf.chk(int, int)
> NullableVarBinaryVector$Mutator.fillEmpties()
> The problem is that each of these methods does some form of bounds checking
> and eventually of course, the actual write to the ByteBuf is also bounds
> checked.
> DrillBuf.chk can be disabled by a configuration setting. Disabling this does
> improve performance of TPCH queries. In addition, all regression, unit, and
> TPCH-SF100 tests pass.
> I would recommend we allow users to turn this check off if there are
> performance critical queries.
> Removing the bounds checking at every level is going to be a fair amount of
> work. In the meantime, it appears that a few simple changes to variable
> length vectors improves query performance by about 10% across the board.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)