[GitHub] spark pull request: [SPARK-6538][SQL] Add missing nullable Metasto...

budde Wed, 25 Mar 2015 13:13:46 -0700

GitHub user budde opened a pull request:

    https://github.com/apache/spark/pull/5188


    [SPARK-6538][SQL] Add missing nullable Metastore fields when merging a 
Parquet schema

    When Spark SQL infers a schema for a DataFrame, it will take the union of 
all field types present in the structured source data (e.g. an RDD of JSON 
data). When the source data for a row doesn't define a particular field on the 
DataFrame's schema, a null value will simply be assumed for this field. This 
workflow makes it very easy to construct tables and query over a set of 
structured data with a nonuniform schema. However, this behavior is not 
consistent in some cases when dealing with Parquet files and an external table 
managed by an external Hive metastore.
    
    In our particular usecase, we use Spark Streaming to parse and transform 
our input data and then apply a window function to save an arbitrary-sized 
batch of data as a Parquet file, which itself will be added as a partition to 
an external Hive table via an *"ALTER TABLE... ADD PARTITION..."* statement. 
Since our input data is nonuniform, it is expected that not every partition 
batch will contain every field present in the table's schema obtained from the 
Hive metastore. As such, we expect that the schema of some of our Parquet files 
may not contain the same set fields present in the full metastore schema.
    
    In such cases, it seems natural that Spark SQL would simply assume null 
values for any missing fields in the partition's Parquet file, assuming these 
fields are specified as nullable by the metastore schema. This is not the case 
in the current implementation of ParquetRelation2. The 
**mergeMetastoreParquetSchema()** method used to reconcile differences between 
a Parquet file's schema and a schema retrieved from the Hive metastore will 
raise an exception if the Parquet file doesn't match the same set of fields 
specified by the metastore.
    
    This pull requests alters the behavior of **mergeMetastoreParquetSchema()** 
by having it first add any nullable fields from the metastore schema to the 
Parquet file schema if they aren't already present there. 
    
    Besides the usual code quality and correctness feedback, I'd appreciate any 
comments specifically on:
    * should this be the assumed behavoir of the 
**mergeMetastoreParquetSchema()** method or should I refactor this pull request 
to make this behavior dependent on a configuration option?
    * am I correct in submitting a pull request to change newParquet.scala in 
branch-1.3 or should I have submitted it to the master branch?
    
    Thanks for taking a look!

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/budde/spark merge-nullable-fields

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5188.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5188
    
----
commit 13ae7bf5fa2fdde16b1c14713a16bdf2c59b28c0
Author: Adam Budde <[email protected]>
Date:   2015-03-25T19:59:34Z

    Add missing nullable Metastore fields when merging a Parquet schema

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6538][SQL] Add missing nullable Metasto...

Reply via email to