GitHub user paul-rogers opened a pull request:
https://github.com/apache/drill/pull/887
DRILL-5688: Add repeated map support to column accessors
Restructures the existing "column accessor" code to adopt a JSON-like
structure that works for all of Drill's data types. This PR focused on the
"repeated map" vector. (Still to come is support for repeated lists, but they
fit into the revised JSON structure.)
This PR has four commits that highlight different parts of the changes:
* The core accessors themselves
* Changes to vector classes along with a new "tuple metadata" class
* Revisions to the "row set" test framework which uses, and tests, the
accessors
* "Collateral damage" changes that pick up changes to the row set classes
and add a number of small test framework improvements.
### Accessors
The accessor structure is explained in `package_info.java` files in the
accessor packages. Basically, the structure is:
* The accessor types are: tuple, array and scalar
* A tuple is a set of (name, type) pairs
* Maps and rows are both tuples
* Arrays are a series of one of the three types
The accessors add an "object" layer that represents any of the three types.
So, a tuple is really a list of (name, object accessor) pairs, where the object
accessor provide access to a scalar, an array or a tuple as appropriate for
each column.
The structure appears complex (since it must model JSON). But, an app using
this code would use just the leaf scalar readers and writers. These classes
currently access data via the value vector `Mutator` and `Accessor` classes.
But, the goal is to eventually access the Netty `PlatformDependent` methods
directly so that there is a single layer between the application and the call
into direct memory. (Today there are multiple layers.)
There is quite a bit of code change here to provide the new structure. But,
the core functionality of reading and writing to vectors has not changed much.
And, this code has extensive unit tests, which should avoid the need to
"mentally execute" each line of code.
### Supporting Classes
A new `TupleMetadata` class is a superset of the existing `BatchSchema`,
but also provides "classic" tuple-like access by position or name. Eventually,
this will also hold additional information such as actual size and so on
(information now laboriously rediscovered by the "record batch sizer.") Since
the accessors use a "tuple" abstraction to model both rows and maps, the tuple
metadata provides this same view. The top-most tuple models the row. Columns
within the row can be maps, which have their own tuple schema, and so on.
`TupleNameSpace` moves locations (so it can be used in the vector package)
but otherwise remains unchanged.
`DrillBuf` provides an experimental `putInt()` method that does bounds
checking and sets a value, to minimize calls. This will probably move into the
writer in a later PR.
This PR fixes DRILL-5690, a bug in repeated vectors that did not pass along
Decimal scale and precision. See `RepeatedValueVectors.java`.
`MaterializedField` changes to add an `isEquivalent()` method to compare
two fields, ignoring internal (`$offset$`, `$bits$`, etc.) vectors.
### Row Set Classes and Tests
The `RowSet` family of classes changed in response to the accessor changes.
* The reader and writer are moved to separate files.
* Row sets now use a "parsed" form of "storage" classes to hold vectors
(more below).
* Static factory methods were added to hide constructor complexity.
* The `RowSetBuilder` and `RowSetComparison` test tools added support for
repeated maps.
* Code to handle generic object writing moved from the `RowSetBuilder` into
the accessors.
* The old `RowSetSchema` evolved to become the `TupleMetadata` mentioned
above.
* Tests were greatly enhanced to test all modes of all supported scalar
types, as well as the new JSON-like structure.
In the previous version, the row set classes had complex logic to figure
out what kind of accessor to create for each vector. This became overly
complex. In this version, the row set "parses" a vector container to create
"storage" objects that represent tuples and columns. A column can, itself, be a
tuple. (Note: there is no need to model lists since lists are just vectors at
this level of abstraction, so need no special handling.)
With this change, accessor creation is a simple matter of walking a tree to
assemble the JSON-structure.
This structure is also used to create a batch's vectors from a schema.
### Other Changes
The last commit contains various other changes, mostly reflecting the
changes above.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/paul-rogers/drill DRILL-5688
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/887.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #887
----
commit 022c5e9ed08c6393166e66ac5e862168bc6c5e77
Author: Paul Rogers <[email protected]>
Date: 2017-07-27T05:03:50Z
DRILL-5688: Add repeated map support to column accessors
Includes the core JSON-like reader and writer interfaces and
implementations.
commit 170101b177c113ebbdf1d0f890b1d80487c0ea2f
Author: Paul Rogers <[email protected]>
Date: 2017-07-27T05:05:36Z
Supporting vector and related classes
Includes changes to value vectors, DrillBuf and other low-level classes.
commit f1ce8ffa6caa3120316ba538a5dc3e918c61da58
Author: Paul Rogers <[email protected]>
Date: 2017-07-27T05:08:07Z
Row set test classes
Modifications to the row set abstraction (used for testing) for the
changed accessors. Row sets also act as tests for the accessor classes,
including a number of tests that test the classes used for testing.
(Yes, somewhat recursiveâ¦)
commit 0310772c1920948c487c4789bb5d0f3fc5e3d012
Author: Paul Rogers <[email protected]>
Date: 2017-07-27T05:09:07Z
Test code affected by the row set changes
Changes to unit tests, and the unit test framework, required by the
changes to the accessor and row set classes.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---