[GitHub] drill issue #887: DRILL-5688: Add repeated map support to column accessors

paul-rogers Mon, 31 Jul 2017 08:59:29 -0700

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/887
  
    This set of commits provide performance tuning of the column writer 
implementation. Discussion during the code review identified a number of tuning 
opportunities that lead to this work. The commits divide the files by layer:
    
    * Value vectors
    * Column accessors
    * Row set classes used for testing
    * Secondary files affected by the above changes
    
    The result is a very nice performance improvement relative to the original 
vector mutator methods. The writers now provide:
    
    * 42% faster writes for required values
    * 73% faster writes for nullable values
    * 408% (4x) faster writes for repeated values
    
    Not a bad outcome from a casual code review conversation!
    
    ### Merge Vector Overflow into Column Writers
    
    DRILL-5688 and DRILL-5657 originally separated the tasks of detecting and 
handling vector overflow. (Vector overflow occurs when we are about to write 
past the maximum buffer size.) The result was that each âsetâ operation 
needed to traverse two layers: the âcolumn loaderâ which handles overflow, 
and the âcolumn writerâ which writes the value to a vector.
    
    Discussion identified an opportunity to combine these into a single layer. 
The revised column accessors in this PR achieve that revision: each accessor 
detects overflow, call a handler to handle the overflow, then continues to 
write operation into the (presumably) new vector.
    
    The âresult set loaderâ (DRILL-5657) abstraction contains the logic to 
handle overflows. The original ârow setâ abstractions for tests donât 
handle overflow; instead they throw a index out-of-bounds exception. (Tests 
seldom write enough data to trigger an overflow unless deliberately testing the 
overflow feature.)
    
    ### Bypass the Vector Mutator
    
    The big performance improvement was to avoid the overhead of the original 
vector mutator. In the original code, each write travels through many layers of 
code:
    
    * Vector mutator
    * Drillbuf
    * UDLE
    * Netty platform dependent
    * Java Unsafe
    
    The revised column writers bypass the first three layers; and instead use 
the Netty platform dependent methods to write to vector memory. In this, they 
follow the text reader implementation which does something similar.
    
    ### Specialized Offset Vector Writer
    
    For arrays (repeated) and variable-width values, the original code is quite 
inefficient as it must do a direct memory read, and two writes, for each value. 
The revised writer code for variable-width fields does two writes. For arrays 
(repeated) values, the writer code does just one write per value and one write 
per row. (This is why the writer code is so much faster: the test writes five 
values per row: 15 direct memory I/Os in the original code, just 6 in the new 
writer code.)
    
    A new `OffsetVectorWriter` class manages the details of writing to the 
offset vector.
    
    ### Highly-Optimized Writer Code
    
    The unfortunate side-effect of the optimization is that the writers are no 
longer simple. The previous version (which should be reviewed first) delegate 
work to the vector mutator. The new version provides an integrated set of code 
that:
    
    * Does one single bounds check per value to
    * Check if the vector size must increase, and if so
    * Check if the growth would cause overflow
    * Also check if the âfill emptiesâ logic is needed to fill in missing 
values
    
    On the other hand, it became clear that the writers could be simplified. 
The previous version, because it worked with the mutators in each vector class, 
needed a separate writer for required, optional and repeated vectors. The new 
code, because it works directly with Nettyâs platform dependent layer, can 
use a single writer for all three data modes.
    
    Now, one might note that the single implementation has some very slight 
overhead:
    
    * The nullable vector need not fill empty âdataâ values (null values 
can be garbage, empty filling is really only needed for the âbitsâ vector)
    * Repeated writers need not fill empty values (there may be empties in the 
offset vector, but not the data vector.)
    
    However, the cost is a single comparison, so it did not seem worth the 
effort to generate three complex classes to avoid a single comparison. (The 
performance test results seem to back up this assumption.)
    
    Because of the above, the `ColumnAccessors.java` template has become a bit 
more complex (as if it was not already plenty complex.) Best is to look at the 
generated `ColumnAccessors` class, and its nested classes, to see the result of 
the template.
    
    ### Non-zeroing Vector Reallocation
    
    The existing `reAlloc` methods double vector size and zero-fill the upper 
half of the new vector. This causes the `setSafe()` methods to use loops to 
call `reAlloc()` multiple times in case the vector may need to quadruple or 
more.
    
    The revised column writer code avoids these two performance penalties by
    
    * Computing the required new vector size so it can be allocated in one go 
and
    * Skipping the zeroing step.
    
    Existing code continues to use the original implementation.
    
    ### Collateral Code Cleanup
    
    Because vector overflow is now integrated into the writers, there is no 
longer a need to throw the prior `VectorOverflowException`, which is now 
removed. The code changes in prior commits to handle this exception are removed 
in this commit.
    
    DRILL-5517 added âsize awareâ methods to value vectors to support the 
earlier version of the writers which used the vector mutators. Since that logic 
is now integrated into the writers themselves, this commit backs out those 
changes to tidy up the value vector code.
    
    Previous versions of the row set writers flattened maps for easier creation 
of test data in the `RowSetBuilder`. Earlier commits changed this behavior to 
require an array of values to populate a map. A number of tests needed updating 
as a result.
    
    Retrofitted a couple of tests to use the `SubOperatorTest` base class.
    
    Split the `MapWriter` into two distinct base classes so that the code can 
set the value count which is implemented in two different classes without a 
common parent class method definition.
    
    Earlier versions had a way to set every column type from an int value for 
testing. Moved this code around to make use of the generic object-based set 
methods now available.
    
    Added a `PerformanceTool` used to compare writer performance to the 
original vector & mutator implementations.
    
    Added a `TestFillEmpties` test which writes every fifth value to every data 
type and every mode to ensure that the âfill emptiesâ logic works for all. 
This also has the handy side-effect of verifying null handling and basic scalar 
writer behavior.
    
    ### Readers
    
    A logical question is, âif the column writers benefited from 
optimization, what about the column readers?â Writers are the focus of the 
âmemory fragmentationâ project which will replace the low-level vector 
writers in various operators.
    
    For now, readers are only used in test code - which is not performance 
critical. We can apply the same technique to readers when they find themselves 
on the critical performance path.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #887: DRILL-5688: Add repeated map support to column accessors

Reply via email to