[GitHub] spark pull request: [SPARK-12992][SQL]: Update parquet reader to s...

nongli Mon, 25 Jan 2016 16:39:06 -0800

GitHub user nongli opened a pull request:

    https://github.com/apache/spark/pull/10908


    [SPARK-12992][SQL]: Update parquet reader to support more types when 
decoding to ColumnarBatch.

    This patch implements support for more types when doing the vectorized 
decode. There are
    a few more types remaining but they should be very straightforward after 
this. This code
    has a few copy and paste pieces but they are difficult to eliminate due to 
performance
    considerations.
    
    Specifically, this patch adds support for:
      - String, Long, Byte types
      - Dictionary encoding for those types.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nongli/spark spark-12992

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10908.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10908
    
----
commit 518d9bc64fd687204b157b1b36816620482716bf
Author: Nong Li <[email protected]>
Date:   2016-01-05T04:39:31Z

    [WIP] [SPARK-12854][SQL] Implement complex types support in ColumnarBatch.
    
    WIP: this patch adds some random row generation. The test code needs to be 
cleaned up as it
    duplicates functionality from else where. The non-test code should be good 
to review.
    
    This patch adds support for complex types for ColumnarBatch. ColumnarBatch 
supports structs
    and arrays. There is a simple mapping between the richer catalyst types to 
these two. Strings
    are treated as an array of bytes.
    
    ColumnarBatch will contain a column for each node of the schema. 
Non-complex schemas consists
    of just leaf nodes. Structs represent an internal node with one child for 
each field. Arrays
    are internal nodes with one child. Structs just contain nullability. Arrays 
contain offsets
    and lengths into the child array. This structure is able to handle 
arbitrary nesting. It has
    the key property that we maintain columnar throughout and that primitive 
types are only stored
    in the leaf nodes and contiguous across rows. For example, if the schema is 
array<array<int>>,
    all of the int data is stored consecutively.
    
    As part of this, this patch adds append APIs in addition to the Put APIs 
(e.g. putLong(rowid, v)
    vs appendLong(v)). These APIs are necessary when the batch contains 
variable length elements.
    The vectors are not fixed length and will grow as necessary. This should 
make the usage a lot
    simpler for the writer.

commit f579889413c3978355fb0fb0d0ac02719803cb30
Author: Nong Li <[email protected]>
Date:   2016-01-25T22:37:07Z

    Updates

commit ea1f406ef9c73c7344128dca23efb1c9ac29a0f0
Author: Nong Li <[email protected]>
Date:   2016-01-20T01:46:29Z

    [SPARK-12992][SQL]: Update parquet reader to support more types when 
decoding to ColumnarBatch.
    
    This patch implements support for more types when doing the vectorized 
decode. There are
    a few more types remaining but they should be very straightfoward after 
this. This code
    has a few copy and paste pieces but they are difficult to eliminate due to 
performance
    considerations.
    
    Specifically, this patch adds support for:
      - String, Long, Byte types
      - Dictionary encoding for those types.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-12992][SQL]: Update parquet reader to s...

Reply via email to