[
https://issues.apache.org/jira/browse/DRILL-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931481#comment-15931481
]
ASF GitHub Bot commented on DRILL-5351:
---------------------------------------
Github user paul-rogers commented on a diff in the pull request:
https://github.com/apache/drill/pull/781#discussion_r106794181
--- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java
---
@@ -502,11 +502,15 @@ public void setSafe(int index, byte[] bytes) {
assert index >= 0;
final int currentOffset = offsetVector.getAccessor().get(index);
- while (data.capacity() < currentOffset + bytes.length) {
- reAlloc();
- }
offsetVector.getMutator().setSafe(index + 1, currentOffset +
bytes.length);
- data.setBytes(currentOffset, bytes, 0, bytes.length);
+ try {
+ data.setBytes(currentOffset, bytes, 0, bytes.length);
+ } catch (IndexOutOfBoundsException e) {
--- End diff --
The Netty/Drillbuf code is complex, so this does boil down to details...
Yes, I agree that Netty does the bounds checks -- if we call Netty code.
Consider this code in DrillBuf:
```
@Override
public ByteBuf setBytes(int index, ByteBuf src, int srcIndex, int length)
{
udle.setBytes(index + offset, src, srcIndex, length);
return this;
}
public ByteBuf setBytes(int index, ByteBuffer src, int srcIndex, int
length) {
if (src.isDirect()) {
checkIndex(index, length);
PlatformDependent.copyMemory(PlatformDependent.directBufferAddress(src) +
srcIndex, this.memoryAddress() + index,
length);
...
```
We have one version that delegates to the "udle" which calls another buf
which calls... all the way to a bounds check.
But, we have another method which cuts out the middleman and just does a
direct memory copy. Given that, we could certainly add a version that does a
bounds check and copy from heap into direct memory. Just use this Netty method:
```
class PlatformDependent ...
public static void copyMemory(byte[] src, int srcIndex, long dstAddr,
long length) ...
```
So, something like this:
```
class Drillbuf ...
public boolean setIfCapacity(int offset, byte data[], int len) {
if (offset + len >= capacity()) { return false; }
PlatformDependent.copyMemory(data, 0,
PlatformDependent.directBufferAddress(src) + offset,
len);
return true;
}
```
Of course, this assumes an implementation of the underlying direct memory,
but, as we saw, we are already doing something similar.
Would this work and perform as well as the exception-based approach?
> Excessive bounds checking in the Parquet reader
> ------------------------------------------------
>
> Key: DRILL-5351
> URL: https://issues.apache.org/jira/browse/DRILL-5351
> Project: Apache Drill
> Issue Type: Improvement
> Reporter: Parth Chandra
> Assignee: Parth Chandra
>
> In profiling the Parquet reader, the variable length decoding appears to be a
> major bottleneck making the reader CPU bound rather than disk bound.
> A yourkit profile indicates the following methods being severe bottlenecks -
> VarLenBinaryReader.determineSizeSerial(long)
> NullableVarBinaryVector$Mutator.setSafe(int, int, int, int, DrillBuf)
> DrillBuf.chk(int, int)
> NullableVarBinaryVector$Mutator.fillEmpties()
> The problem is that each of these methods does some form of bounds checking
> and eventually of course, the actual write to the ByteBuf is also bounds
> checked.
> DrillBuf.chk can be disabled by a configuration setting. Disabling this does
> improve performance of TPCH queries. In addition, all regression, unit, and
> TPCH-SF100 tests pass.
> I would recommend we allow users to turn this check off if there are
> performance critical queries.
> Removing the bounds checking at every level is going to be a fair amount of
> work. In the meantime, it appears that a few simple changes to variable
> length vectors improves query performance by about 10% across the board.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)