Hi all,

I wanted to start a thread discussing Avro cross-version support in
parquet-java. The parquet-avro module has been on Avro 1.11 since the 1.13
release, but since then we've made fixes and added feature support for Avro
1.8 APIs (ex1 <https://github.com/apache/parquet-java/pull/2957>, ex2
<https://github.com/apache/parquet-java/pull/2993>).

Mostly the Avro APIs referenced by parquet-avro are
cross-version-compatible, with a few exceptions:

   - Evolution of Schema constructor APIs
   - New logical types (i.e., local timestamp and UUID)
   - Renamed logical type conversion helpers
   - Generated code for datetime types using Java Time vs Joda Time for
   setters/getters

Some of these are hard to catch when Parquet is compiled and tested with
Avro 1.11 only. Additionally, as a user who mostly relies on Avro 1.8
currently, I'm not sure how much longer Parquet will continue to support it.

I have two proposals to build confidence and clarity around parquet-avro:

   - Codifying in the parquet-avro documentation
   <https://github.com/apache/parquet-java/blob/master/parquet-avro/README.md>
   which Avro versions are officially supported and which are
   deprecated/explicitly not supported
   - Adding some kind of automated testing with all supported Avro
   versions. This is a bit tricky because as I mentioned, the generated
   SpecificRecord classes use incompatible logical type APIs across Avro
   versions, so we'd have to find a way to invoke avro-compiler/load the Avro
   core library for different versions... this would probably require a
   multi-module setup.

I'd love to know what the Parquet community thinks about these ideas.
Additionally, I'm interested to learn more about what Avro versions other
Parquet users rely on. Seems like there's a lot of variance across the data
ecosystem--Spark keeps up-to-date with latest Avro version, Hadoop has Avro
1.9 pinned, and Apache Beam used to be tightly coupled with 1.8, but has
recently refactored to be version-agnostic.

Best,
Claire

Reply via email to