[jira] [Commented] (DRILL-5351) Excessive bounds checking in the Parquet reader

ASF GitHub Bot (JIRA) Sat, 18 Mar 2017 15:34:07 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931426#comment-15931426
 ]


ASF GitHub Bot commented on DRILL-5351:
---------------------------------------

Github user parthchandra commented on a diff in the pull request:

    https://github.com/apache/drill/pull/781#discussion_r106792968
  
    --- Diff: exec/vector/src/main/codegen/templates/VariableLengthVectors.java 
---
    @@ -502,11 +502,15 @@ public void setSafe(int index, byte[] bytes) {
           assert index >= 0;
     
           final int currentOffset = offsetVector.getAccessor().get(index);
    -      while (data.capacity() < currentOffset + bytes.length) {
    -        reAlloc();
    -      }
           offsetVector.getMutator().setSafe(index + 1, currentOffset + 
bytes.length);
    -      data.setBytes(currentOffset, bytes, 0, bytes.length);
    +      try {
    +        data.setBytes(currentOffset, bytes, 0, bytes.length);
    +      } catch (IndexOutOfBoundsException e) {
    --- End diff --
    
    The check for a DrillBuf exceeding bounds is already being done once in 
Netty (which also throws the Exception). What we want to do  is to avoid having 
to do more than one bounds check for every vector write operation. By adding 
the suggested check, we simply move the check from every vector setSafe method 
to every Drillbuf write. This would possibly impact performance in other parts 
of the code. 
    I particularly changed only var len vectors to address a hotspot in the 
Parquet reader performance, because as you are suggesting, the right way to fix 
is to address the low density batch problem at the same time as the vector 
overflow problem. Perhaps a longer discussion may be required here.



> Excessive bounds checking in the Parquet reader 
> ------------------------------------------------
>
>                 Key: DRILL-5351
>                 URL: https://issues.apache.org/jira/browse/DRILL-5351
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Parth Chandra
>            Assignee: Parth Chandra
>
> In profiling the Parquet reader, the variable length decoding appears to be a 
> major bottleneck making the reader CPU bound rather than disk bound.
> A yourkit profile indicates the following methods being severe bottlenecks -
> VarLenBinaryReader.determineSizeSerial(long)
>   NullableVarBinaryVector$Mutator.setSafe(int, int, int, int, DrillBuf)
>   DrillBuf.chk(int, int)
>   NullableVarBinaryVector$Mutator.fillEmpties()
> The problem is that each of these methods does some form of bounds checking 
> and eventually of course, the actual write to the ByteBuf is also bounds 
> checked.
> DrillBuf.chk can be disabled by a configuration setting. Disabling this does 
> improve performance of TPCH queries. In addition, all regression, unit, and 
> TPCH-SF100 tests pass. 
> I would recommend we allow users to turn this check off if there are 
> performance critical queries.
> Removing the bounds checking at every level is going to be a fair amount of 
> work. In the meantime, it appears that a few simple changes to variable 
> length vectors improves query performance by about 10% across the board. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (DRILL-5351) Excessive bounds checking in the Parquet reader

Reply via email to