[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

paul-rogers Thu, 21 Sep 2017 14:47:30 -0700

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/914
  
    This next commit reintroduces a projection feature. With this change, a 
client can:
    
    * Define the set of columns to project
    * Define the schema of the data source (table, file, etc.)
    * Write columns according to the schema
    * Harvest only the projected columns.
    
    ### Example
    
    Here is a simple example adapted from `TestResultSetLoaderProjection`. 
First, declare the projection
    
    ```
        List<SchemaPath> selection = Lists.newArrayList(
            SchemaPath.getSimplePath("c"),
            SchemaPath.getSimplePath("b"),
            SchemaPath.getSimplePath("e"));
    ```
    
    Then, declare the schema. (Here, we declare the schema up-front. Projection 
also works if the schema is defined as columns are discovered while creating a 
batch.)
    
    ```
        TupleMetadata schema = new SchemaBuilder()
            .add("a", MinorType.INT)
            .add("b", MinorType.INT)
            .add("c", MinorType.INT)
            .add("d", MinorType.INT)
            .buildSchema();
     ```
    
    Then, use the options mechanisms to pass the information to the result set 
loader:
    
    ```
       ResultSetOptions options = new OptionBuilder()
            .setProjection(selection)
            .setSchema(schema)
            .build();
        ResultSetLoader rsLoader = new ResultSetLoaderImpl(fixture.allocator(), 
options);
    ```
    
    Now, we can write the four columns in the data source:
    
    ```
        RowSetLoader rootWriter = rsLoader.writer();
        rsLoader.startBatch();
    â¦
        rootWriter.start();
        rootWriter.scalar(âaâ).setInt(10);
        rootWriter.scalar(âbâ).setInt(20);
        rootWriter.scalar(âcâ).setInt(30);
        rootWriter.scalar(âdâ).setInt(40);
        rootWriter.save();
    
    ```
    
    But, when we harvest the results, we get only the projected columns. Notice 
that âeâ is projected, but does not exist in the table, and so is not 
projected to the output. A higher level of code will handle this case.
    
    ```
    #: b, c
    0: 20, 30
    ```
    
    ### Maps
    
    Although the above example does not show the feature, the mechanism also 
handles maps and arrays of maps. The rules are:
    
    * If the projection list includes specific map members (such as âm.bâ), 
then project only those map members.
    * If the projection list includes just the map name (such as âmâ), then 
project all map members (such as âm.aâ and âm.bâ.)
    * If the projection list does not include the map at all, then project 
neither the map nor any of its members.
    
    ### Implementation
    
    The implementation builds on previous commits. The idea is that we create a 
âdummyâ column and writer, but we do not create the underlying value 
vector. This allows the client to be blissfully ignorant of whether the column 
is projected or not. On the other hand, if the client wants to know if a column 
is projected (perhaps to optimize away certain operations), then the projection 
status is available in the column metadata.
    
    #### Projection Set
    
    Projection starts with a `ProjectionSet` abstraction. Each tuple (row, map) 
has a projection set. The projection set can be a set of names 
(`ProjectionSetImpl`) or a default (`NullProjectionSet`).
    
    #### Result Set Loader
    
    The result set loader is modified to check if a column is projected. If so, 
the code flow is the same as previously. If not, then the code will create the 
dummy vector state and dummy writers described above.
    
    Adding support for non-projected columns involved the usual amount of 
refactoring and moving bits of code around to get a simple solution.
    
    #### Accessor Factories
    
    Prior versions had a `ColumnAccessorFactory` class that created both 
readers and writers. This commit splits the class into separate reader and 
writer factories. The writer factory now creates dummy writers if asked to 
create a writer when the backing vector is null. To make this easier, factory 
code that previously appeared in each writer has moved into the writer factory. 
(Note that readers donât support projection: there is no need.)
    
    #### Dummy Writers
    
    The accessor layer is modified to create a set of dummy writers. Scalar 
writers have a wide (internal) interface. Dummy scalar writers simply ignore 
the unsupported operations. Dummy array and tuple writers are also provided.
    
    #### Unit Test
    
    The new `TestResultSetLoaderProjection` test exercises the new code. The 
new `DummyWriterTest` exercises the dummy writers.

---

[GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure

Reply via email to