GitHub user liancheng opened a pull request:
https://github.com/apache/spark/pull/8070
[SPARK-9340] [SQ] Fixes converting unannotated Parquet lists
This PR is inspired by #8063 authored by @dguy. Especially, testing Parquet
files added here are all taken from that PR.
**Committer who merges this PR should attribute it to "Damian Guy
<[email protected]>".**
----
SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement
backwards-compatibility rules defined in `parquet-format` spec. However, both
Spark SQL and `parquet-avro` neglected the following statement in
`parquet-format`:
> ...
> This does not affect repeated fields that are not annotated: A repeated
field that is neither contained by a `LIST`- or `MAP`-annotated group nor
annotated by `LIST` or `MAP` should be interpreted as a required list of
required elements where the element type is the type of the field.
One of the consequences is that, Parquet files generated by
`parquet-protobuf` containing unannotated repeated fields are not correctly
converted to Catalyst arrays.
This PR fixes this issue by
1. Handling unannotated repeated fields in `CatalystSchemaConverter`.
2. Converting this kind of special repeated fields to Catalyst arrays in
`CatalystRowConverter`.
Two special converters, `RepeatedPrimitiveConverter` and
`RepeatedGroupConverter`, are added. They delegate actual conversion work to a
child `elementConverter` and accumulates elements in an `ArrayBuffer`.
Two extra methods, `start()` and `end()`, are added to
`ParentContainerUpdater`. So that they can be used to initialize new
`ArrayBuffer`s for unannotated repeated fields, and propagate converted array
values to upstream.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/liancheng/spark
spark-9340/unannotated-parquet-list
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/8070.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #8070
----
commit a11b7c09f2686d2db210a147acfec18b021d555e
Author: Cheng Lian <[email protected]>
Date: 2015-08-10T16:57:10Z
Fixes converting unannotated Parquet lists
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]