[GitHub] drill pull request #1244: DRILL-6373: Refactor Result Set Loader for Union, ...

paul-rogers Mon, 30 Apr 2018 19:32:55 -0700

GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/1244


    DRILL-6373: Refactor Result Set Loader for Union, List support

    This PR builds on the previous refactoring of the column accessors to 
prepare for Union, (non-repeated) List and Repeated List support. The PR 
includes four closely related changes divided across four commits:
    
    ### Correct the Type of the Data Vector in a Nullable Vector
    
    The nullable vectors contain a "bits" vector and a "data" vector. The data 
vector has historically been created using the same `MaterializedField` as the 
nullable vector, meaning that the data vector is labeled as "nullable" even 
though it has no bits vector.
    
    This PR creates a clone `MaterializedField` with the same name as the outer 
nullable vector, but with a Required type.
    
    This change ensures that the overflow logic works correctly as it uses the 
vector metadata (in the `MaterializedField`) to know what kind of vector to 
create for the "lookahead" vector.
    
    ### Result Set Loader Refactor
    
    The second commit pretty much just rearranges the deck chairs in a way that 
we an slot in the new types in the next PR. The need for the changes can be 
seen in the full code set (the union and list support was pulled out for this 
PR.)
    
    A union is a container, like a map, so the tuple state was refactored to 
create a common parent container state.
    
    List and unions are very complex to build, so the code to build the 
internal workings of each vector was pulled out into a separate builder class.
    
    ### Projection Handling and the Vector Cache
    
    Previous versions of the result set loader handled projection and a cache 
for vectors reused across readers in the same Scan operator. Once we introduce 
nested maps, projection within maps, unions and lists, projection gets much 
more complex, as does vector caching.
    
    This PR adds logic to support projection and vector caching to any 
arbitrary level of maps. It turns out that handling projection of an entire 
map, and projection of fields within maps, is far more complex than you'd 
think, requiring quite a bit of internal state to keep everything straight. The 
result is that we can now handle a map `m` with three fields `{a, b, c}` and 
project just one of them, `m.a`, say.
    
    Further, Drill allows projection of non-existent columns. So, we might ask 
for field `m.d` which does not exist in the above map. The projection mechanism 
handles this case as well, creating the right kind of null column.
    
    ### Unit Tests
    
    New tests are added to exercise the projection and cache mechanisms. 
Existing tests were updated for the changes made in the refactoring.
    
    ### Reference Design
    
    All of this work is done in support of the overall "batch sizing" project 
explained 
[here](https://github.com/paul-rogers/drill/wiki/Batch-Handling-Upgrades).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-6373

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1244.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1244
    
----
commit 7df4280f3011862b84b43240c8a07e0bf019745d
Author: Paul Rogers <progers@...>
Date:   2018-05-01T02:10:53Z

    DRILL-6373: Fix nullable vector data vector type
    
    Fixes the type of the data vector within a nullable vector. The data vector 
is Required (has no bits vector.) Accurate metadata is required for proper 
overflow handling in the result set loader.

commit 9496ef681f19f03ccd735f2c9b18f6d914eae3e2
Author: Paul Rogers <progers@...>
Date:   2018-05-01T02:15:33Z

    DRILL-6373: Refactor result set loader

commit 74675436ae1efdf66deafeaa27b281d169e274ad
Author: Paul Rogers <progers@...>
Date:   2018-05-01T02:16:48Z

    DRILL-6373: Revised projection & vector cache

commit 04598c0dbdbbff5ecbc2f89d02b14f66982f86bd
Author: Paul Rogers <progers@...>
Date:   2018-05-01T02:17:22Z

    DRILL-6373: Revised & added unit tests

----


---

[GitHub] drill pull request #1244: DRILL-6373: Refactor Result Set Loader for Union, ...

Reply via email to