Wei,

I think the best practice is to have an overall schema for the data that can be satisfied using all of the currently-written file schemas. For example, you'd read the column with a long schema, which can handle both ints and longs in the data. Ints just get promoted when reading.

How would merging the schemas help? Hive should do the same resolution that I'm talking about here, but should use the current table definition to generate its expected schema. Spark SQL might be relying on this, which I'll follow up on with the Spark community.

rb

On 05/11/2015 10:52 AM, Wei Yan wrote:
Thanks for the update, Ryan.
Yes, I found this info in https://issues.apache.org/jira/browse/PARQUET-139,
which avoids to merge the schema in the client side.

And for schema merge, is then plan for defining some rules for merging
schemas, like merging "int" and "long" to a "long" field? I asked this
because we have some parquet files written by different schemas, due to
some **history** reason. Allow this type of merging can help a lot when we
process the data. Besides MapReduce application, we also meet the schema
problem when using hive and spark-sql to load the data.

-Wei



--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to