GitHub user nongli opened a pull request:
https://github.com/apache/spark/pull/10820
[WIP] [SPARK-12854][SQL] Implement complex types support in ColumnarBatch
WIP: this patch adds some random row generation testing. The test code
needs to be cleaned up as it
duplicates functionality from else where. The non-test code should be good
to review.
This patch adds support for complex types for ColumnarBatch. ColumnarBatch
supports structs
and arrays. There is a simple mapping between the richer catalyst types to
these two. Strings
are treated as an array of bytes.
ColumnarBatch will contain a column for each node of the schema.
Non-complex schemas consists
of just leaf nodes. Structs represent an internal node with one child for
each field. Arrays
are internal nodes with one child. Structs just contain nullability. Arrays
contain offsets
and lengths into the child array. This structure is able to handle
arbitrary nesting. It has
the key property that we maintain columnar throughout and that primitive
types are only stored
in the leaf nodes and contiguous across rows. For example, if the schema is
array<array<int>>,
all of the int data is stored consecutively.
As part of this, this patch adds append APIs in addition to the Put APIs
(e.g. putLong(rowid, v)
vs appendLong(v)). These APIs are necessary when the batch contains
variable length elements.
The vectors are not fixed length and will grow as necessary. This should
make the usage a lot
simpler for the writer.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/nongli/spark spark-12854
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10820.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10820
----
commit 518d9bc64fd687204b157b1b36816620482716bf
Author: Nong Li <[email protected]>
Date: 2016-01-05T04:39:31Z
[WIP] [SPARK-12854][SQL] Implement complex types support in ColumnarBatch.
WIP: this patch adds some random row generation. The test code needs to be
cleaned up as it
duplicates functionality from else where. The non-test code should be good
to review.
This patch adds support for complex types for ColumnarBatch. ColumnarBatch
supports structs
and arrays. There is a simple mapping between the richer catalyst types to
these two. Strings
are treated as an array of bytes.
ColumnarBatch will contain a column for each node of the schema.
Non-complex schemas consists
of just leaf nodes. Structs represent an internal node with one child for
each field. Arrays
are internal nodes with one child. Structs just contain nullability. Arrays
contain offsets
and lengths into the child array. This structure is able to handle
arbitrary nesting. It has
the key property that we maintain columnar throughout and that primitive
types are only stored
in the leaf nodes and contiguous across rows. For example, if the schema is
array<array<int>>,
all of the int data is stored consecutively.
As part of this, this patch adds append APIs in addition to the Put APIs
(e.g. putLong(rowid, v)
vs appendLong(v)). These APIs are necessary when the batch contains
variable length elements.
The vectors are not fixed length and will grow as necessary. This should
make the usage a lot
simpler for the writer.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]