[
https://issues.apache.org/jira/browse/SPARK-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380663#comment-14380663
]
Apache Spark commented on SPARK-6538:
-------------------------------------
User 'budde' has created a pull request for this issue:
https://github.com/apache/spark/pull/5188
> Add missing nullable Metastore fields when merging a Parquet schema
> -------------------------------------------------------------------
>
> Key: SPARK-6538
> URL: https://issues.apache.org/jira/browse/SPARK-6538
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.3.0
> Reporter: Adam Budde
> Fix For: 1.3.1
>
>
> When Spark SQL infers a schema for a DataFrame, it will take the union of all
> field types present in the structured source data (e.g. an RDD of JSON data).
> When the source data for a row doesn't define a particular field on the
> DataFrame's schema, a null value will simply be assumed for this field. This
> workflow makes it very easy to construct tables and query over a set of
> structured data with a nonuniform schema. However, this behavior is not
> consistent in some cases when dealing with Parquet files and an external
> table managed by an external Hive metastore.
> In our particular usecase, we use Spark Streaming to parse and transform our
> input data and then apply a window function to save an arbitrary-sized batch
> of data as a Parquet file, which itself will be added as a partition to an
> external Hive table via an "ALTER TABLE... ADD PARTITION..." statement. Since
> our input data is nonuniform, it is expected that not every partition batch
> will contain every field present in the table's schema obtained from the Hive
> metastore. As such, we expect that the schema of some of our Parquet files
> may not contain the same set fields present in the full metastore schema.
> In such cases, it seems natural that Spark SQL would simply assume null
> values for any missing fields in the partition's Parquet file, assuming these
> fields are specified as nullable by the metastore schema. This is not the
> case in the current implementation of ParquetRelation2. The
> mergeMetastoreParquetSchema() method used to reconcile differences between a
> Parquet file's schema and a schema retrieved from the Hive metastore will
> raise an exception if the Parquet file doesn't match the same set of fields
> specified by the metastore.
> I propose altering this implementation in order to allow for any missing
> metastore fields marked as nullable to be merged in to the Parquet file's
> schema before continuing with the checks present in
> mergeMetastoreParquetSchema().
> Classifying this as a bug as it exposes inconsistent behavior, IMHO. If you
> feel this should be an improvement or new feature instead, please feel free
> to reclassify this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]