[GitHub] spark pull request: [SPARK-6776] [SPARK-8811] [SQL] Refactors Parq...

liancheng Sun, 05 Jul 2015 19:15:07 -0700

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/7231


    [SPARK-6776] [SPARK-8811] [SQL] Refactors Parquet read path for 
interoperability and backwards-compatibility

    This PR ensures interoperability and backwards-compatibility for the read 
path.  A new `CatalystRowConverter` class is introduced to replace the old 
`CatalystConverter` class hierarchy.  The new class basically resembles 
`AvroIndexedRecordConverter` in parquet-avro.
    
    Now Spark SQL is expected to be able to read legacy Parquet data files 
generated by most (if not all) common libraries/tools like parquet-thrift, 
parquet-avro, and parquet-hive.
    
    Major changes:
    
    1. `CatalystConverter` class hierarchy refactoring
    
       - Replaces `CatalystConverter` trait with a much simpler 
`ParentContainerUpdater`.
     
         Now instead of extending the original `CatalystConverter` trait, every 
converter class accepts an updater which is responsible to propagate the 
converted value to some parent container. For example, appending array elements 
to a parent array buffer, appending a key-value pairs to a parent mutable map, 
or setting a converted value to some specific field of a parent row. Root 
converter doesn't have a parent and thus uses a `NoopUpdater`.
     
         This simplifies the design since converters don't need to care about 
details of their parent converters anymore.
     
       - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and 
`CatalystPrimitiveRowConverter` into `CatalystRowConverter`
     
         Specifically, now all row objects are represented by 
`SpecificMutableRow` during conversion.
     
       - Refactors `CatalystArrayConverter`, and removes 
`CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter`
     
         `CatalystNativeArrayConverter` was probably designed with the 
intention of avoiding boxing costs. However, the way it uses Scala generics 
actually doesn't achieve this goal.
     
         The new `CatalystArrayConverter` handles both nullable and 
non-nullable array elements in a consistent way.
     
       - Implemented backwards-compatibility rules in `CatalystArrayConverter`
     
         When Parquet records are being converted, schema of Parquet files 
should have already been verified. So we only need to care about the structure 
rather than field names in the Parquet schema. Since all map objects 
represented in legacy systems have the same structure as the standard one (see 
[backwards-compatibility rules for MAP] [1]), we only need to deal with LIST 
(namely array) in `CatalystArrayConverter`.
    
    2. Requested columns handling
    
       When specifying requested columns in `RowReadSupport`, we used to use a 
Parquet `MessageType` converted from a Catalyst `StructType` which contains all 
requested columns.  This is not preferable when taking compatibility and 
interoperability into consideration.  Because the actual Parquet file may have 
different physical structure from the converted schema.
    
       In this PR, the schema for requested columns is constructed via the 
following method:
    
       - For a column that exists in the target Parquet file, we extract the 
column type by name from the full file schema, and construct single-field 
`MessageType` for that column.
       - For a column that doesn't exist in the target Parquet file, we create 
a single-field `StructType` and convert it to a `MessageType` using 
`CatalystSchemaConverter`.
       - Unions all single-field `MessageType`s into a full schema containing 
all requested fields
    
    [1]: 
https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark spark-6776

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/7231.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #7231
    
----
commit 1b76491839592f67aceec5974cb7ec69e7ad58f3
Author: Cheng Lian <[email protected]>
Date:   2015-07-05T06:04:35Z

    Refactors Parquet read path to implement backwards-compatibility rules

commit 999c27ae6639495a23f3b0493568856315f0be69
Author: Cheng Lian <[email protected]>
Date:   2015-07-05T06:44:54Z

    Removes old Parquet record converters

commit 00b068fbaa2260bd96b07f579615ea5753149b7d
Author: Cheng Lian <[email protected]>
Date:   2015-07-05T07:30:37Z

    More comments

commit 45b632e44fc5577fe9c31d19df1d62dee85aea96
Author: Cheng Lian <[email protected]>
Date:   2015-07-05T07:30:50Z

    Removes the 16-byte restriction of decimals

commit 0c9335bfbebad7ac10864787b63ce9124e6ebd6a
Author: Cheng Lian <[email protected]>
Date:   2015-07-05T09:26:39Z

    Assembles requested schema from Parquet file schema

commit c93424be246ff45c37cdb995a2b6d37ba97ab3f1
Author: Cheng Lian <[email protected]>
Date:   2015-07-05T09:27:05Z

    Adds test case for SPARK-8811

commit bce07f5953b9675c66e9429e73f543747dbec0b5
Author: Cheng Lian <[email protected]>
Date:   2015-07-05T09:29:35Z

    Reverts an unnecessary debugging change

commit 61cce6ccddc0d61ccb28bd97c11c14749b720f0b
Author: Cheng Lian <[email protected]>
Date:   2015-07-06T01:53:00Z

    Adds explicit return type

commit 35814975543120e995a93fbf84e6fb57436d956d
Author: Cheng Lian <[email protected]>
Date:   2015-07-06T01:55:03Z

    Fixes bugs related to schema merging and empty requested columns

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6776] [SPARK-8811] [SQL] Refactors Parq...

Reply via email to