GitHub user nongli opened a pull request:
https://github.com/apache/spark/pull/10908
[SPARK-12992][SQL]: Update parquet reader to support more types when
decoding to ColumnarBatch.
This patch implements support for more types when doing the vectorized
decode. There are
a few more types remaining but they should be very straightforward after
this. This code
has a few copy and paste pieces but they are difficult to eliminate due to
performance
considerations.
Specifically, this patch adds support for:
- String, Long, Byte types
- Dictionary encoding for those types.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/nongli/spark spark-12992
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10908.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10908
----
commit 518d9bc64fd687204b157b1b36816620482716bf
Author: Nong Li <[email protected]>
Date: 2016-01-05T04:39:31Z
[WIP] [SPARK-12854][SQL] Implement complex types support in ColumnarBatch.
WIP: this patch adds some random row generation. The test code needs to be
cleaned up as it
duplicates functionality from else where. The non-test code should be good
to review.
This patch adds support for complex types for ColumnarBatch. ColumnarBatch
supports structs
and arrays. There is a simple mapping between the richer catalyst types to
these two. Strings
are treated as an array of bytes.
ColumnarBatch will contain a column for each node of the schema.
Non-complex schemas consists
of just leaf nodes. Structs represent an internal node with one child for
each field. Arrays
are internal nodes with one child. Structs just contain nullability. Arrays
contain offsets
and lengths into the child array. This structure is able to handle
arbitrary nesting. It has
the key property that we maintain columnar throughout and that primitive
types are only stored
in the leaf nodes and contiguous across rows. For example, if the schema is
array<array<int>>,
all of the int data is stored consecutively.
As part of this, this patch adds append APIs in addition to the Put APIs
(e.g. putLong(rowid, v)
vs appendLong(v)). These APIs are necessary when the batch contains
variable length elements.
The vectors are not fixed length and will grow as necessary. This should
make the usage a lot
simpler for the writer.
commit f579889413c3978355fb0fb0d0ac02719803cb30
Author: Nong Li <[email protected]>
Date: 2016-01-25T22:37:07Z
Updates
commit ea1f406ef9c73c7344128dca23efb1c9ac29a0f0
Author: Nong Li <[email protected]>
Date: 2016-01-20T01:46:29Z
[SPARK-12992][SQL]: Update parquet reader to support more types when
decoding to ColumnarBatch.
This patch implements support for more types when doing the vectorized
decode. There are
a few more types remaining but they should be very straightfoward after
this. This code
has a few copy and paste pieces but they are difficult to eliminate due to
performance
considerations.
Specifically, this patch adds support for:
- String, Long, Byte types
- Dictionary encoding for those types.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]