Hi everyone,
Last community sync, we had a few Variant topics that I want to also raise
here.

First, I want to highlight that we talked about spec versioning and there
was agreement to avoid unnecessary complexity by maintaining one version
that covers both encoding and shredding. The two parts of Variant are in
separate markdown docs, but we don’t want to have fragmentation where an
implementation follows the encoding but not shredding. We want to keep the
two fully compatible and ensure that you can rely on shredding for data
skipping. Note that this doesn’t add many requirements for writers
(shredding is optional), but ensures that shredding can be used by
requiring support in readers. This wasn’t very controversial, but please
reply if you have concerns about it.

Second, in the sync I also brought up the need to get the Variant type
annotation into a release. There was some pushback on this, I think,
because non-Java projects have a simpler build process. I don’t think many
people realized how the Java side works:

   - The thrift file is released in the parquet-format Jar
   - The parquet-java maven build has a parquet-format-structures module
   that downloads the parquet-format Jar and generates Java classes from the
   Thrift definition
   - Then there is a translation step in ParquetMetadataConverter that
   converts to Parquet (rather than Thrift) classes
   - The Parquet API adds utilities to use the converted metadata classes,
   like LogicalTypeAnnotationVisitor

We discussed creating sample files to verify Variant compatibility, but in
order to produce those files from the Variant implementation I’m building,
I would need to produce a one-off version of the thrift definition, inject
it into the parquet-thrift-format-structures build, update the Parquet API
to add a VariantLogicalTypeAnnotation, and then find a way to get that
temporary build into a Maven repository so that PRs can depend on it. Those
PRs can’t be merged because they would rely on an unpublished/snapshot Jar.

This is becoming a big headache and I don’t think that it helps us deliver
Variant reliably. We know that we are going to support Variant data and the
type annotation to label that data is going to be needed. There is little
risk in committing that annotation now. The argument against adding it now
was that it may create confusion, but I think it is unlikely that anyone
would see the annotation in parquet-java or parquet-format and assume
support guarantees. The spec is clearly marked experimental and the
annotation is referenced already by LogicalTypes.md
<https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant>.
Adding the annotation to the API doesn’t significantly increase the number
of people seeing it, and the people that do are already working with
Parquet internals.

The only other objection I know about is that we haven’t yet finalized what
goes in the annotation, which I agree is a prerequisite for moving forward.
I suggest that we add the encoding/shredding version (1), release
parquet-format, and start working on the API extensions in parquet-java.
Then we can actually work on a Java implementation that stores data in
files in the parquet-java project.

Does that sound reasonable?

Ryan

Reply via email to