Hi,

I've hit an issue recently with the parquet schema produced from
parquet-protobuf. As it is not consistent with what Avro and Thrift
produce, a downstream app, Spark SQL, couldn't read files containing
repeated types. We've since had this fixed here:
https://issues.apache.org/jira/browse/SPARK-9340

While it is possible for downstream users of parquet to handle the
incompatibilities between the various different input formats it seems a
shame that parquet doesn't produce a consistent schema across all formats.
This would make working with parquet much simpler as the rules for
converting to/from a list etc would always be the same. It would make for
simpler, less error prone, code. Besides, i thought this was one of the
reasons for using parquet...

I submitted a pull request that addresses the issue we have been facing:
https://github.com/apache/parquet-mr/pull/253

Is there any reason why you wouldn't want to have a consistent parquet
representation?

Thanks,
Damian

Reply via email to