[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

paul-rogers Sat, 16 Sep 2017 15:26:44 -0700

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/914
  
    The second group of five commits refines the result set loader mechanism.
    
    ### Model Layer Revision
    
    Drill is unique for a query engine in that it handles structured types as 
first-class data types. For example, Drill supports maps (called âstructsâ 
by Hive and Impala), but also supports arrays of maps. Drill support simple 
scalars, and also arrays of scalars. Put these together and we have maps that 
contain arrays of maps that contain arrays of scalars. The result is a tree 
structure described in the Drill documentation as modeled on JSON.
    
    The prior version used a âmodelâ layer to construct internal structures 
that models the tree-structure of Drillâs data types. The model âreifiedâ 
the structure into a set of objects. While this worked well, it added more 
complexity than necessary, especially when dynamically evolving a schema or 
working out how to handle scan-time projection.
    
    This version retains the model layer, but as a series of algorithms that 
walk the vector container structure rather than as a separate data structure. 
The model layer still provides tools to build readers for âsingleâ and 
âhyperâ vectors, to extract a metadata schema from a set of vectors, to 
create writers for a âsingleâ vector container, and so on.
    
    Replacing the object structure with algorithms required changes to both the 
row set abstractions and the result set loader.
    
    The key loss in this change is the set of âvisitorsâ from the previous 
revision. The reified model allowed all visitors to use a common structure. The 
new solution still has visitors, but now they are ad-hoc, walking the container 
tree in different ways depending on whether the code can work with columns 
generically, or needs to deal with individual (single) vectors or hyper vectors.
    
    ### Revised Result Set Loader Column and Vector State
    
    To understand the need for many of the changes in this commit, it helps to 
take a step back and remember what weâre trying to do. We want to write to 
vectors, but control the resulting batch size.
    
    #### Background
    
    Writing to vectors is easy if we deal only with flat rows and donât worry 
about batch size:
    
    * Vectors provide `Mutator` classes that write to single vectors.
    * A set of âlegacyâ vector writers are available, and are used by some 
readers.
    * Generated code uses `Mutator` and `Accessor` classes to work with vectors.
    * The âRow Setâ classes, used for testing, provides a refined column 
writer to populate batches.
    
    The above are the easy parts. Some challenges include:
    
    * None of the above limit batch memory, they only limit row count.
    * Writing directly to vectors requires that the client deal with the 
complexity of tracking a common position across vectors.
    * Drillâs tree structure makes everything more complex as positions must 
be tracked across multiple repetition levels (see below).
    
    The âresult set loaderâ (along with the column writers) provides a next 
level of completeness by tackling the vector memory size problem, for the 
entire set of Drill structured types.
    
    #### Overflow and Rollover
    
    The key trick is to handle vector âoverflowâ seamlessly shifting 
writes, mid-row, from a full batch, to a new âlook-aheadâ batch. The 
process of shifting data is called ârollover.â
    
    To implement rollover, we need to work with two sets of vectors:
    
    * The âactiveâ set: the vectors âunderâ the writers and returned 
downstream.
    * The âbackupâ set: holds the buffers not currently in use.
    
    During an overflow event, buffers are shifted between the active and backup 
vectors:
    
    * On overflow, the full buffers reside in the active vectors.
    * After rollover, the full buffers reside in the backup vectors, and new 
buffers, now holding the in-flight row (called the âlook-aheadâ row), 
reside in the active vectors.
    * When âharvestingâ the full batch, the full buffers and look-ahead 
vectors are exchange, so the full buffers are back in the active vectors.
    * When starting the next batch for writing, the look-ahead row is shifted 
from the backup vectors into the active vectors and the cycle repeats.
    
    #### Column and Vector State
    
    When writing without overflow, we need only one vector and so the usual 
Drill vector container is sufficient. With overflow, we have two sets of 
vectors, and must perform operations on them, so we need a place to store the 
data. This is the purpose of the âcolumn stateâ and âvector stateâ 
classes.
    
    Think of the overall result set loader structure a has having three key 
parts:
    
    * Result set loader: manages the entire batch
    * Column writers: accepts writes to vectors
    * Column state: manages the vectors themselves.
    
    The result set loader and column writers are part of the public API. Column 
state is an internal concept.
    
    Drill is columnar. Each column in a query is represented by a column state 
object. This object tracks the state of the column: normal, overflow, etc.
    
    Every column state must hold two vectors. The vector state object holds the 
active and backup vectors, and performs the required operations on these two 
vectors.
    
    Vector state is separate from column state because some columns use 
multiple vectors. For example, an array of scalars has an offset vector and a 
data vector. This column is represented by a single column state object, but 
two vector state objects.
    
    Maps are special: they have a column state for the map as a whole, along 
with a âtuple stateâ to represent the columns within the map. The row 
itself is a tuple, and so also uses the tuple state object.
    
    Repeated maps have not just the tuple state, but also an offset vector that 
indexes into the tuple.
    
    #### Schema Subtlety
    
    Suppose a application uses the result set loader to write a batch up 
through overflow. Suppose that 1000 rows fit into the batch and the 1001st 
causes overflow. The above description handles this case.
    
    Now suppose that on row 1001 the application adds a new column. Should the 
column appear within the batch that contains the first 1000 rows? One can argue 
either way. Consider this argument. If the application just used row counts, 
the application would have read 1000 rows and sent the batch downstream. Then, 
on the 1001st row (the first row of the next batch) the application would have 
added the new column.
    
    Using this reasoning, we adopt a rule that a new column appears only in the 
batch that contains the row in which the column was added. So, in the overflow 
case, the new column does not appear in the first batch, but does appear in the 
second.
    
    To make this work, the column state tracks schema version numbers and can 
hold onto columns that are not yet visible. The result set loader builds an 
âoutputâ schema that contains only those columns that are visible. This 
means that the vector container sent downstream *must* be distinct from the 
structure used to hold vectors in the result set loader.
    
    For this reason, the tuple and column states are the mechanism to hold the 
entire schema, while the output vector container is built to include only the 
visible columns.
    
    This mechanism is even more necessary in the next commit which will add 
projection back into the result set loader mechanism.
    
    The point is: the tuple and column states allow the result set loader to 
hold internally a larger set of columns than are made visible to the downstream 
operators via the output vector container.
    
    ### âTortureâ Test
    
    This commit adds a new result set loader test called the âtorture 
test.â This single test combines all the complexities that the result set 
loader must handle:
    
    * Deeply nested repeated structures
    * Null values
    * Missing values
    * Omitted rows
    * Vector Rollover
    
    This test has lived up to its name, revealing several subtle bugs in each 
mechanism. These bugs lead to increased understanding of the requirements for 
the underlying mechanisms, which resulted in many subtle changes to the 
accessor and result set loader layers. Little new functionality was added, but 
the existing functionality was refined to handle many odd cases.
    
    ### Repetition Levels
    
    The tests mostly revealed issues with managing offsets in the layers of 
repeated types. Drill has two kinds of nesting in its tree structure: 
structures and arrays. Structures just add a level to the name space. (For 
example, a top-level map introduces a name space, âmâ, say, with its 
columns nested within that name space: âm.aâ, âm.bâ.) The other kind of 
nesting introduces a ârepetition level.â Drill has four kinds of repetition 
levels:
    
    * Rows within a row set (batch)
    * Scalars within an array of scalars
    * Maps (structs) within an array of structs (repeated map)
    * Lists (not yet supported in this code)
    
    Getting all the offset vectors, pointers, and rollover logic to work was 
non-trivial. A persistent source of errors is the fact that offsets written for 
a row are one head of the row itself. That is, row 10 writes its end offset in 
position 11, and so on. This requires many, many special cases.
    
    ### Rollover Refinement
    
    When a vector overflows, data must be ârolled overâ from the in-flight 
batch to a new one. Rollover is simple with âflatâ rows. But, again, many 
subtle problems crop up when dealing with repetition levels and their 
corresponding offset vectors and nested indexes. The code here refines rollover 
handling with several new âwriter eventsâ to prepare for, and clean-up 
after rollover.
    
    Rollover requires work at both the vector and writer level. Work is divided 
as follows:
    
    * A âpreRollover()â event in the writers prepares vectors for overflow 
by filling empties and setting value counts.
    * The result set loader âcolumn stateâ and âvector stateâ classes 
move data from one set of vectors to another.
    * The âpostRollover()â even in the writers resets writer indexes, 
addresses and other pointers to reflect the new data positions.
    
    ### Even More Javadoc
    
    As the code has stabilized, it has become worthwhile describing the 
structure in detail in Javadoc, complete with âASCII-artâ to illustrate 
some of the concepts when working with Drillâs tree-structured vector model.
    
    ### More Tests
    
    Additional lower-level ârow setâ tests were added for various bits of 
the code. Also, a couple of tools for printing structures were added to aid 
debugging. (It is easier to fix thing when we can visualize the complex data 
structures and vector contents.)
    
    ### Other Changes
    
    A few nested classes have grown larger and were pulled out into their own 
files.

---

[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

Reply via email to