[GitHub] drill pull request #887: DRILL-5688: Add repeated map support to column acce...

paul-rogers Wed, 26 Jul 2017 22:49:13 -0700

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/887


    DRILL-5688: Add repeated map support to column accessors

    Restructures the existing "column accessor" code to adopt a JSON-like 
structure that works for all of Drill's data types. This PR focused on the 
"repeated map" vector. (Still to come is support for repeated lists, but they 
fit into the revised JSON structure.)
    
    This PR has four commits that highlight different parts of the changes:
    
    * The core accessors themselves
    * Changes to vector classes along with a new "tuple metadata" class
    * Revisions to the "row set" test framework which uses, and tests, the 
accessors
    * "Collateral damage" changes that pick up changes to the row set classes 
and add a number of small test framework improvements.
    
    ### Accessors
    
    The accessor structure is explained in `package_info.java` files in the 
accessor packages. Basically, the structure is:
    
    * The accessor types are: tuple, array and scalar
    * A tuple is a set of (name, type) pairs
    * Maps and rows are both tuples
    * Arrays are a series of one of the three types
    
    The accessors add an "object" layer that represents any of the three types. 
So, a tuple is really a list of (name, object accessor) pairs, where the object 
accessor provide access to a scalar, an array or a tuple as appropriate for 
each column.
    
    The structure appears complex (since it must model JSON). But, an app using 
this code would use just the leaf scalar readers and writers. These classes 
currently access data via the value vector `Mutator` and `Accessor` classes. 
But, the goal is to eventually access the Netty `PlatformDependent` methods 
directly so that there is a single layer between the application and the call 
into direct memory. (Today there are multiple layers.)
    
    There is quite a bit of code change here to provide the new structure. But, 
the core functionality of reading and writing to vectors has not changed much. 
And, this code has extensive unit tests, which should avoid the need to 
"mentally execute" each line of code.
    
    ### Supporting Classes
    
    A new `TupleMetadata` class is a superset of the existing `BatchSchema`, 
but also provides "classic" tuple-like access by position or name. Eventually, 
this will also hold additional information such as actual size and so on 
(information now laboriously rediscovered by the "record batch sizer.") Since 
the accessors use a "tuple" abstraction to model both rows and maps, the tuple 
metadata provides this same view. The top-most tuple models the row. Columns 
within the row can be maps, which have their own tuple schema, and so on.
    
    `TupleNameSpace` moves locations (so it can be used in the vector package) 
but otherwise remains unchanged.
    
    `DrillBuf` provides an experimental `putInt()` method that does bounds 
checking and sets a value, to minimize calls. This will probably move into the 
writer in a later PR.
    
    This PR fixes DRILL-5690, a bug in repeated vectors that did not pass along 
Decimal scale and precision. See `RepeatedValueVectors.java`.
    
    `MaterializedField` changes to add an `isEquivalent()` method to compare 
two fields, ignoring internal (`$offset$`, `$bits$`, etc.) vectors.
    
    ### Row Set Classes and Tests
    
    The `RowSet` family of classes changed in response to the accessor changes.
    
    * The reader and writer are moved to separate files.
    * Row sets now use a "parsed" form of "storage" classes to hold vectors 
(more below).
    * Static factory methods were added to hide constructor complexity.
    * The `RowSetBuilder` and `RowSetComparison` test tools added support for 
repeated maps.
    * Code to handle generic object writing moved from the `RowSetBuilder` into 
the accessors.
    * The old `RowSetSchema` evolved to become the `TupleMetadata` mentioned 
above.
    * Tests were greatly enhanced to test all modes of all supported scalar 
types, as well as the new JSON-like structure.
    
    In the previous version, the row set classes had complex logic to figure 
out what kind of accessor to create for each vector. This became overly 
complex. In this version, the row set "parses" a vector container to create 
"storage" objects that represent tuples and columns. A column can, itself, be a 
tuple. (Note: there is no need to model lists since lists are just vectors at 
this level of abstraction, so need no special handling.)
    
    With this change, accessor creation is a simple matter of walking a tree to 
assemble the JSON-structure.
    
    This structure is also used to create a batch's vectors from a schema.
    
    ### Other Changes
    
    The last commit contains various other changes, mostly reflecting the 
changes above.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-5688

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/887.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #887
    
----
commit 022c5e9ed08c6393166e66ac5e862168bc6c5e77
Author: Paul Rogers <[email protected]>
Date:   2017-07-27T05:03:50Z

    DRILL-5688: Add repeated map support to column accessors
    
    Includes the core JSON-like reader and writer interfaces and
    implementations.

commit 170101b177c113ebbdf1d0f890b1d80487c0ea2f
Author: Paul Rogers <[email protected]>
Date:   2017-07-27T05:05:36Z

    Supporting vector and related classes
    
    Includes changes to value vectors, DrillBuf and other low-level classes.

commit f1ce8ffa6caa3120316ba538a5dc3e918c61da58
Author: Paul Rogers <[email protected]>
Date:   2017-07-27T05:08:07Z

    Row set test classes
    
    Modifications to the row set abstraction (used for testing) for the
    changed accessors. Row sets also act as tests for the accessor classes,
    including a number of tests that test the classes used for testing.
    (Yes, somewhat recursiveâ¦)

commit 0310772c1920948c487c4789bb5d0f3fc5e3d012
Author: Paul Rogers <[email protected]>
Date:   2017-07-27T05:09:07Z

    Test code affected by the row set changes
    
    Changes to unit tests, and the unit test framework, required by the
    changes to the accessor and row set classes.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #887: DRILL-5688: Add repeated map support to column acce...

Reply via email to