GitHub user liancheng opened a pull request:
https://github.com/apache/spark/pull/7231
[SPARK-6776] [SPARK-8811] [SQL] Refactors Parquet read path for
interoperability and backwards-compatibility
This PR ensures interoperability and backwards-compatibility for the read
path. A new `CatalystRowConverter` class is introduced to replace the old
`CatalystConverter` class hierarchy. The new class basically resembles
`AvroIndexedRecordConverter` in parquet-avro.
Now Spark SQL is expected to be able to read legacy Parquet data files
generated by most (if not all) common libraries/tools like parquet-thrift,
parquet-avro, and parquet-hive.
Major changes:
1. `CatalystConverter` class hierarchy refactoring
- Replaces `CatalystConverter` trait with a much simpler
`ParentContainerUpdater`.
Now instead of extending the original `CatalystConverter` trait, every
converter class accepts an updater which is responsible to propagate the
converted value to some parent container. For example, appending array elements
to a parent array buffer, appending a key-value pairs to a parent mutable map,
or setting a converted value to some specific field of a parent row. Root
converter doesn't have a parent and thus uses a `NoopUpdater`.
This simplifies the design since converters don't need to care about
details of their parent converters anymore.
- Unifies `CatalystRootConverter`, `CatalystGroupConverter` and
`CatalystPrimitiveRowConverter` into `CatalystRowConverter`
Specifically, now all row objects are represented by
`SpecificMutableRow` during conversion.
- Refactors `CatalystArrayConverter`, and removes
`CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter`
`CatalystNativeArrayConverter` was probably designed with the
intention of avoiding boxing costs. However, the way it uses Scala generics
actually doesn't achieve this goal.
The new `CatalystArrayConverter` handles both nullable and
non-nullable array elements in a consistent way.
- Implemented backwards-compatibility rules in `CatalystArrayConverter`
When Parquet records are being converted, schema of Parquet files
should have already been verified. So we only need to care about the structure
rather than field names in the Parquet schema. Since all map objects
represented in legacy systems have the same structure as the standard one (see
[backwards-compatibility rules for MAP] [1]), we only need to deal with LIST
(namely array) in `CatalystArrayConverter`.
2. Requested columns handling
When specifying requested columns in `RowReadSupport`, we used to use a
Parquet `MessageType` converted from a Catalyst `StructType` which contains all
requested columns. This is not preferable when taking compatibility and
interoperability into consideration. Because the actual Parquet file may have
different physical structure from the converted schema.
In this PR, the schema for requested columns is constructed via the
following method:
- For a column that exists in the target Parquet file, we extract the
column type by name from the full file schema, and construct single-field
`MessageType` for that column.
- For a column that doesn't exist in the target Parquet file, we create
a single-field `StructType` and convert it to a `MessageType` using
`CatalystSchemaConverter`.
- Unions all single-field `MessageType`s into a full schema containing
all requested fields
[1]:
https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/liancheng/spark spark-6776
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7231.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7231
----
commit 1b76491839592f67aceec5974cb7ec69e7ad58f3
Author: Cheng Lian <[email protected]>
Date: 2015-07-05T06:04:35Z
Refactors Parquet read path to implement backwards-compatibility rules
commit 999c27ae6639495a23f3b0493568856315f0be69
Author: Cheng Lian <[email protected]>
Date: 2015-07-05T06:44:54Z
Removes old Parquet record converters
commit 00b068fbaa2260bd96b07f579615ea5753149b7d
Author: Cheng Lian <[email protected]>
Date: 2015-07-05T07:30:37Z
More comments
commit 45b632e44fc5577fe9c31d19df1d62dee85aea96
Author: Cheng Lian <[email protected]>
Date: 2015-07-05T07:30:50Z
Removes the 16-byte restriction of decimals
commit 0c9335bfbebad7ac10864787b63ce9124e6ebd6a
Author: Cheng Lian <[email protected]>
Date: 2015-07-05T09:26:39Z
Assembles requested schema from Parquet file schema
commit c93424be246ff45c37cdb995a2b6d37ba97ab3f1
Author: Cheng Lian <[email protected]>
Date: 2015-07-05T09:27:05Z
Adds test case for SPARK-8811
commit bce07f5953b9675c66e9429e73f543747dbec0b5
Author: Cheng Lian <[email protected]>
Date: 2015-07-05T09:29:35Z
Reverts an unnecessary debugging change
commit 61cce6ccddc0d61ccb28bd97c11c14749b720f0b
Author: Cheng Lian <[email protected]>
Date: 2015-07-06T01:53:00Z
Adds explicit return type
commit 35814975543120e995a93fbf84e6fb57436d956d
Author: Cheng Lian <[email protected]>
Date: 2015-07-06T01:55:03Z
Fixes bugs related to schema merging and empty requested columns
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]