Adam Budde created SPARK-6538:
---------------------------------
Summary: Add missing nullable Metastore fields when merging a
Parquet schema
Key: SPARK-6538
URL: https://issues.apache.org/jira/browse/SPARK-6538
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.3.0
Reporter: Adam Budde
Fix For: 1.3.1
When Spark SQL infers a schema for a DataFrame, it will take the union of all
field types present in the structured source data (e.g. an RDD of JSON data).
When the source data for a row doesn't define a particular field on the
DataFrame's schema, a null value will simply be assumed for this field. This
workflow makes it very easy to construct tables and query over a set of
structured data with a nonuniform schema. However, this behavior is not
consistent in some cases when dealing with Parquet files and an external table
managed by an external Hive metastore.
In our particular usecase, we use Spark Streaming to parse and transform our
input data and then apply a window function to save an arbitrary-sized batch of
data as a Parquet file, which itself will be added as a partition to an
external Hive table via an "ALTER TABLE... ADD PARTITION..." statement. Since
our input data is nonuniform, it is expected that not every partition batch
will contain every field present in the table's schema obtained from the Hive
metastore. As such, we expect that the schema of some of our Parquet files may
not contain the same set fields present in the full metastore schema.
In such cases, it seems natural that Spark SQL would simply assume null values
for any missing fields in the partition's Parquet file, assuming these fields
are specified as nullable by the metastore schema. This is not the case in the
current implementation of ParquetRelation2. The mergeMetastoreParquetSchema()
method used to reconcile differences between a Parquet file's schema and a
schema retrieved from the Hive metastore will raise an exception if the Parquet
file doesn't match the same set of fields specified by the metastore.
I propose altering this implementation in order to allow for any missing
metastore fields marked as nullable to be merged in to the Parquet file's
schema before continuing with the checks present in
mergeMetastoreParquetSchema().
Classifying this as a bug as it exposes inconsistent behavior, IMHO. If you
feel this should be an improvement or new feature instead, please feel free to
reclassify this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]