In my opinion, we should merge a PR with a caveated annotation to unblock development.
Rather than opine on how easy/hard it should be for a development workflow, given it seems like this is causing serious headaches / slowing down implementation my vote would be to do whatever is helpful to speed up the process Andrew On Tue, Mar 18, 2025 at 4:22 PM Ryan Blue <rdb...@gmail.com> wrote: > Hi everyone, > Last community sync, we had a few Variant topics that I want to also raise > here. > > First, I want to highlight that we talked about spec versioning and there > was agreement to avoid unnecessary complexity by maintaining one version > that covers both encoding and shredding. The two parts of Variant are in > separate markdown docs, but we don’t want to have fragmentation where an > implementation follows the encoding but not shredding. We want to keep the > two fully compatible and ensure that you can rely on shredding for data > skipping. Note that this doesn’t add many requirements for writers > (shredding is optional), but ensures that shredding can be used by > requiring support in readers. This wasn’t very controversial, but please > reply if you have concerns about it. > > Second, in the sync I also brought up the need to get the Variant type > annotation into a release. There was some pushback on this, I think, > because non-Java projects have a simpler build process. I don’t think many > people realized how the Java side works: > > - The thrift file is released in the parquet-format Jar > - The parquet-java maven build has a parquet-format-structures module > that downloads the parquet-format Jar and generates Java classes from > the > Thrift definition > - Then there is a translation step in ParquetMetadataConverter that > converts to Parquet (rather than Thrift) classes > - The Parquet API adds utilities to use the converted metadata classes, > like LogicalTypeAnnotationVisitor > > We discussed creating sample files to verify Variant compatibility, but in > order to produce those files from the Variant implementation I’m building, > I would need to produce a one-off version of the thrift definition, inject > it into the parquet-thrift-format-structures build, update the Parquet API > to add a VariantLogicalTypeAnnotation, and then find a way to get that > temporary build into a Maven repository so that PRs can depend on it. Those > PRs can’t be merged because they would rely on an unpublished/snapshot Jar. > > This is becoming a big headache and I don’t think that it helps us deliver > Variant reliably. We know that we are going to support Variant data and the > type annotation to label that data is going to be needed. There is little > risk in committing that annotation now. The argument against adding it now > was that it may create confusion, but I think it is unlikely that anyone > would see the annotation in parquet-java or parquet-format and assume > support guarantees. The spec is clearly marked experimental and the > annotation is referenced already by LogicalTypes.md > < > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant > >. > Adding the annotation to the API doesn’t significantly increase the number > of people seeing it, and the people that do are already working with > Parquet internals. > > The only other objection I know about is that we haven’t yet finalized what > goes in the annotation, which I agree is a prerequisite for moving forward. > I suggest that we add the encoding/shredding version (1), release > parquet-format, and start working on the API extensions in parquet-java. > Then we can actually work on a Java implementation that stores data in > files in the parquet-java project. > > Does that sound reasonable? > > Ryan >