Make the parquet schema consistent across all input formats

Damian Guy Tue, 11 Aug 2015 02:49:36 -0700

Hi,

I've hit an issue recently with the parquet schema produced from
parquet-protobuf. As it is not consistent with what Avro and Thrift
produce, a downstream app, Spark SQL, couldn't read files containing
repeated types. We've since had this fixed here:
https://issues.apache.org/jira/browse/SPARK-9340


While it is possible for downstream users of parquet to handle the
incompatibilities between the various different input formats it seems a
shame that parquet doesn't produce a consistent schema across all formats.
This would make working with parquet much simpler as the rules for
converting to/from a list etc would always be the same. It would make for
simpler, less error prone, code. Besides, i thought this was one of the
reasons for using parquet...

I submitted a pull request that addresses the issue we have been facing:
https://github.com/apache/parquet-mr/pull/253

Is there any reason why you wouldn't want to have a consistent parquet
representation?

Thanks,
Damian

Make the parquet schema consistent across all input formats

Reply via email to