GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/8070

    [SPARK-9340] [SQ] Fixes converting unannotated Parquet lists

    This PR is inspired by #8063 authored by @dguy. Especially, testing Parquet 
files added here are all taken from that PR.
    
    **Committer who merges this PR should attribute it to "Damian Guy 
<[email protected]>".**
    
    ----
    
    SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement 
backwards-compatibility rules defined in `parquet-format` spec. However, both 
Spark SQL and `parquet-avro` neglected the following statement in 
`parquet-format`:
    
    > ...
    > This does not affect repeated fields that are not annotated: A repeated 
field that is neither contained by a `LIST`- or `MAP`-annotated group nor 
annotated by `LIST` or `MAP` should be interpreted as a required list of 
required elements where the element type is the type of the field.
    
    One of the consequences is that, Parquet files generated by 
`parquet-protobuf` containing unannotated repeated fields are not correctly 
converted to Catalyst arrays.
    
    This PR fixes this issue by
    
    1. Handling unannotated repeated fields in `CatalystSchemaConverter`.
    2. Converting this kind of special repeated fields to Catalyst arrays in 
`CatalystRowConverter`.
    
       Two special converters, `RepeatedPrimitiveConverter` and 
`RepeatedGroupConverter`, are added. They delegate actual conversion work to a 
child `elementConverter` and accumulates elements in an `ArrayBuffer`.
    
       Two extra methods, `start()` and `end()`, are added to 
`ParentContainerUpdater`. So that they can be used to initialize new 
`ArrayBuffer`s for unannotated repeated fields, and propagate converted array 
values to upstream.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark 
spark-9340/unannotated-parquet-list

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8070.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8070
    
----
commit a11b7c09f2686d2db210a147acfec18b021d555e
Author: Cheng Lian <[email protected]>
Date:   2015-08-10T16:57:10Z

    Fixes converting unannotated Parquet lists

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to