[GitHub] spark pull request: [SPARK-10136] [SQL] A more robust fix for SPAR...

liancheng Fri, 21 Aug 2015 10:52:14 -0700

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/8361


    [SPARK-10136] [SQL] A more robust fix for SPARK-10136

    PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root 
cause.  The real problem can be rather tricky to explain, and requires 
audiences to be pretty familiar with parquet-format spec, especially details of 
`LIST` backwards-compatibility rules.  Let me have a try to give an explanation 
here.
    
    The structure of the problematic Parquet schema generated by parquet-avro 
is something like this:
    
    ```
    message m {
      <repetition> group f (LIST) {         // Level 1
        repeated group array (LIST) {       // Level 2
          repeated <primitive-type> array;  // Level 3
        }
      }
    }
    ```
    
    (The schema generated by parquet-thrift is structurally similar, just 
replace the `array` at level 2 with `f_tuple`, and the other one at level 3 
with `f_tuple_tuple`.)
    
    This structure consists of two nested legacy 2-level `LIST`-like structures:
    
    1. The repeated group type at level 2 is the element type of the outer 
array defined at level 1
    
       This group should map to an `CatalystArrayConverter.ElementConverter` 
when building converters.
    
    2. The repeated primitive type at level 3 is the element type of the inner 
array defined at level 2
    
       This group should also map to an 
`CatalystArrayConverter.ElementConverter`.
    
    The root cause of SPARK-10136 is that, the group at level 2 isn't properly 
recognized as the element type of level 1.  Thus, according to parquet-format 
spec, the repeated primitive at level 3 is left as a so called "unannotated 
repeated primitive type", and is recognized as a required list of required 
primitive type, thus a `RepeatedPrimitiveConverter` instead of a 
`CatalystArrayConverter.ElementConverter` is created for it.
    
    According to  parquet-format spec, unannotated repeated type shouldn't 
appear in a `LIST`- or `MAP`-annotated group.  PR #8341 fixed this issue by 
allowing such unannotated repeated type appear in `LIST`-annotated groups, 
which is a non-standard, hacky, but valid fix.  (I didn't realize this when 
authoring #8341 though.)
    
    As for the reason why level 2 isn't recognized as a list element type, it's 
because of the following `LIST` backwards-compatibility rule defined in the 
parquet-format spec:
    
    > If the repeated field is a group with one field and is named either 
`array` or uses the `LIST`-annotated group's name with `_tuple` appended then 
the repeated type is the element type and elements are required.
    
    (The `array` part is for parquet-avro compatibility, while the `_tuple` 
part is for parquet-thrift.)
    
    This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], 
but neglected in [`CatalystRowConverter.isElementType`] [2].  This PR delivers 
a more robust fix by adding this rule in the latter method.
    
    Note that parquet-avro 1.7.0 also suffers from this issue. Details can be 
found at [PARQUET-364] [3].
    
    [1]: 
https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305
    [2]: 
https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463
    [3]: https://issues.apache.org/jira/browse/PARQUET-364

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark spark-10136/proper-version

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/8361.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #8361
    
----
commit 15cd105f11a411b1cdbcdd678e2a571ed2c86743
Author: Cheng Lian <[email protected]>
Date:   2015-08-21T16:43:18Z

    Fixes SPARK-10136 in a more proper way

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-10136] [SQL] A more robust fix for SPAR...

Reply via email to