GitHub user paul-rogers opened a pull request:
https://github.com/apache/drill/pull/1244
DRILL-6373: Refactor Result Set Loader for Union, List support
This PR builds on the previous refactoring of the column accessors to
prepare for Union, (non-repeated) List and Repeated List support. The PR
includes four closely related changes divided across four commits:
### Correct the Type of the Data Vector in a Nullable Vector
The nullable vectors contain a "bits" vector and a "data" vector. The data
vector has historically been created using the same `MaterializedField` as the
nullable vector, meaning that the data vector is labeled as "nullable" even
though it has no bits vector.
This PR creates a clone `MaterializedField` with the same name as the outer
nullable vector, but with a Required type.
This change ensures that the overflow logic works correctly as it uses the
vector metadata (in the `MaterializedField`) to know what kind of vector to
create for the "lookahead" vector.
### Result Set Loader Refactor
The second commit pretty much just rearranges the deck chairs in a way that
we an slot in the new types in the next PR. The need for the changes can be
seen in the full code set (the union and list support was pulled out for this
PR.)
A union is a container, like a map, so the tuple state was refactored to
create a common parent container state.
List and unions are very complex to build, so the code to build the
internal workings of each vector was pulled out into a separate builder class.
### Projection Handling and the Vector Cache
Previous versions of the result set loader handled projection and a cache
for vectors reused across readers in the same Scan operator. Once we introduce
nested maps, projection within maps, unions and lists, projection gets much
more complex, as does vector caching.
This PR adds logic to support projection and vector caching to any
arbitrary level of maps. It turns out that handling projection of an entire
map, and projection of fields within maps, is far more complex than you'd
think, requiring quite a bit of internal state to keep everything straight. The
result is that we can now handle a map `m` with three fields `{a, b, c}` and
project just one of them, `m.a`, say.
Further, Drill allows projection of non-existent columns. So, we might ask
for field `m.d` which does not exist in the above map. The projection mechanism
handles this case as well, creating the right kind of null column.
### Unit Tests
New tests are added to exercise the projection and cache mechanisms.
Existing tests were updated for the changes made in the refactoring.
### Reference Design
All of this work is done in support of the overall "batch sizing" project
explained
[here](https://github.com/paul-rogers/drill/wiki/Batch-Handling-Upgrades).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/paul-rogers/drill DRILL-6373
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/drill/pull/1244.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1244
----
commit 7df4280f3011862b84b43240c8a07e0bf019745d
Author: Paul Rogers <progers@...>
Date: 2018-05-01T02:10:53Z
DRILL-6373: Fix nullable vector data vector type
Fixes the type of the data vector within a nullable vector. The data vector
is Required (has no bits vector.) Accurate metadata is required for proper
overflow handling in the result set loader.
commit 9496ef681f19f03ccd735f2c9b18f6d914eae3e2
Author: Paul Rogers <progers@...>
Date: 2018-05-01T02:15:33Z
DRILL-6373: Refactor result set loader
commit 74675436ae1efdf66deafeaa27b281d169e274ad
Author: Paul Rogers <progers@...>
Date: 2018-05-01T02:16:48Z
DRILL-6373: Revised projection & vector cache
commit 04598c0dbdbbff5ecbc2f89d02b14f66982f86bd
Author: Paul Rogers <progers@...>
Date: 2018-05-01T02:17:22Z
DRILL-6373: Revised & added unit tests
----
---