[
https://issues.apache.org/jira/browse/DRILL-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931514#comment-15931514
]
ASF GitHub Bot commented on DRILL-5351:
---------------------------------------
Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/781#discussion_r106795253
--- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java
---
@@ -502,11 +502,15 @@ public void setSafe(int index, byte[] bytes) {
assert index >= 0;
final int currentOffset = offsetVector.getAccessor().get(index);
- while (data.capacity() < currentOffset + bytes.length) {
- reAlloc();
- }
offsetVector.getMutator().setSafe(index + 1, currentOffset +
bytes.length);
- data.setBytes(currentOffset, bytes, 0, bytes.length);
+ try {
+ data.setBytes(currentOffset, bytes, 0, bytes.length);
+ } catch (IndexOutOfBoundsException e) {
--- End diff --
By the way, if the suggested change blows up the scope of this fix, then
the original proposal is fine; we can always adjust it later if needed when we
solve the low-density batch problem.
> Excessive bounds checking in the Parquet reader
> ------------------------------------------------
>
> Key: DRILL-5351
> URL: https://issues.apache.org/jira/browse/DRILL-5351
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: Parth Chandra
> Assignee: Parth Chandra
>
> In profiling the Parquet reader, the variable length decoding appears to be a
> major bottleneck making the reader CPU bound rather than disk bound.
> A yourkit profile indicates the following methods being severe bottlenecks -
> VarLenBinaryReader.determineSizeSerial(long)
> NullableVarBinaryVector$Mutator.setSafe(int, int, int, int, DrillBuf)
> DrillBuf.chk(int, int)
> NullableVarBinaryVector$Mutator.fillEmpties()
> The problem is that each of these methods does some form of bounds checking
> and eventually of course, the actual write to the ByteBuf is also bounds
> checked.
> DrillBuf.chk can be disabled by a configuration setting. Disabling this does
> improve performance of TPCH queries. In addition, all regression, unit, and
> TPCH-SF100 tests pass.
> I would recommend we allow users to turn this check off if there are
> performance critical queries.
> Removing the bounds checking at every level is going to be a fair amount of
> work. In the meantime, it appears that a few simple changes to variable
> length vectors improves query performance by about 10% across the board.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)