Re: Met a schema problem when using AvroParquetInputFormat

Ryan Blue Mon, 11 May 2015 11:03:49 -0700

Wei,

I think the best practice is to have an overall schema for the data thatcan be satisfied using all of the currently-written file schemas. Forexample, you'd read the column with a long schema, which can handle bothints and longs in the data. Ints just get promoted when reading.

How would merging the schemas help? Hive should do the same resolutionthat I'm talking about here, but should use the current table definitionto generate its expected schema. Spark SQL might be relying on this,which I'll follow up on with the Spark community.


rb

On 05/11/2015 10:52 AM, Wei Yan wrote:

Thanks for the update, Ryan.
Yes, I found this info in https://issues.apache.org/jira/browse/PARQUET-139,
which avoids to merge the schema in the client side.

And for schema merge, is then plan for defining some rules for merging
schemas, like merging "int" and "long" to a "long" field? I asked this
because we have some parquet files written by different schemas, due to
some **history** reason. Allow this type of merging can help a lot when we
process the data. Besides MapReduce application, we also meet the schema
problem when using hive and spark-sql to load the data.

-Wei




--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Met a schema problem when using AvroParquetInputFormat

Reply via email to