[GitHub] spark pull request: [WIP] [SPARK-12854][SQL] Implement complex typ...

nongli Mon, 18 Jan 2016 15:43:13 -0800

GitHub user nongli opened a pull request:

    https://github.com/apache/spark/pull/10820


    [WIP] [SPARK-12854][SQL] Implement complex types support in ColumnarBatch

    WIP: this patch adds some random row generation testing. The test code 
needs to be cleaned up as it
    duplicates functionality from else where. The non-test code should be good 
to review.
    
    This patch adds support for complex types for ColumnarBatch. ColumnarBatch 
supports structs
    and arrays. There is a simple mapping between the richer catalyst types to 
these two. Strings
    are treated as an array of bytes.
    
    ColumnarBatch will contain a column for each node of the schema. 
Non-complex schemas consists
    of just leaf nodes. Structs represent an internal node with one child for 
each field. Arrays
    are internal nodes with one child. Structs just contain nullability. Arrays 
contain offsets
    and lengths into the child array. This structure is able to handle 
arbitrary nesting. It has
    the key property that we maintain columnar throughout and that primitive 
types are only stored
    in the leaf nodes and contiguous across rows. For example, if the schema is 
array<array<int>>,
    all of the int data is stored consecutively.
    
    As part of this, this patch adds append APIs in addition to the Put APIs 
(e.g. putLong(rowid, v)
    vs appendLong(v)). These APIs are necessary when the batch contains 
variable length elements.
    The vectors are not fixed length and will grow as necessary. This should 
make the usage a lot
    simpler for the writer.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nongli/spark spark-12854

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10820.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10820
    
----
commit 518d9bc64fd687204b157b1b36816620482716bf
Author: Nong Li <[email protected]>
Date:   2016-01-05T04:39:31Z

    [WIP] [SPARK-12854][SQL] Implement complex types support in ColumnarBatch.
    
    WIP: this patch adds some random row generation. The test code needs to be 
cleaned up as it
    duplicates functionality from else where. The non-test code should be good 
to review.
    
    This patch adds support for complex types for ColumnarBatch. ColumnarBatch 
supports structs
    and arrays. There is a simple mapping between the richer catalyst types to 
these two. Strings
    are treated as an array of bytes.
    
    ColumnarBatch will contain a column for each node of the schema. 
Non-complex schemas consists
    of just leaf nodes. Structs represent an internal node with one child for 
each field. Arrays
    are internal nodes with one child. Structs just contain nullability. Arrays 
contain offsets
    and lengths into the child array. This structure is able to handle 
arbitrary nesting. It has
    the key property that we maintain columnar throughout and that primitive 
types are only stored
    in the leaf nodes and contiguous across rows. For example, if the schema is 
array<array<int>>,
    all of the int data is stored consecutively.
    
    As part of this, this patch adds append APIs in addition to the Put APIs 
(e.g. putLong(rowid, v)
    vs appendLong(v)). These APIs are necessary when the batch contains 
variable length elements.
    The vectors are not fixed length and will grow as necessary. This should 
make the usage a lot
    simpler for the writer.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [WIP] [SPARK-12854][SQL] Implement complex typ...

Reply via email to