Re: Next steps for Variant

Andrew Lamb Tue, 18 Mar 2025 16:06:19 -0700

In my opinion, we should merge a  PR with a caveated annotation to unblock
development.


Rather than opine on how easy/hard it should be for a development workflow,
given it seems like this is causing serious headaches / slowing down
implementation my vote would be to do whatever is helpful to speed up the
process

Andrew

On Tue, Mar 18, 2025 at 4:22 PM Ryan Blue <rdb...@gmail.com> wrote:

> Hi everyone,
> Last community sync, we had a few Variant topics that I want to also raise
> here.
>
> First, I want to highlight that we talked about spec versioning and there
> was agreement to avoid unnecessary complexity by maintaining one version
> that covers both encoding and shredding. The two parts of Variant are in
> separate markdown docs, but we don’t want to have fragmentation where an
> implementation follows the encoding but not shredding. We want to keep the
> two fully compatible and ensure that you can rely on shredding for data
> skipping. Note that this doesn’t add many requirements for writers
> (shredding is optional), but ensures that shredding can be used by
> requiring support in readers. This wasn’t very controversial, but please
> reply if you have concerns about it.
>
> Second, in the sync I also brought up the need to get the Variant type
> annotation into a release. There was some pushback on this, I think,
> because non-Java projects have a simpler build process. I don’t think many
> people realized how the Java side works:
>
>    - The thrift file is released in the parquet-format Jar
>    - The parquet-java maven build has a parquet-format-structures module
>    that downloads the parquet-format Jar and generates Java classes from
> the
>    Thrift definition
>    - Then there is a translation step in ParquetMetadataConverter that
>    converts to Parquet (rather than Thrift) classes
>    - The Parquet API adds utilities to use the converted metadata classes,
>    like LogicalTypeAnnotationVisitor
>
> We discussed creating sample files to verify Variant compatibility, but in
> order to produce those files from the Variant implementation I’m building,
> I would need to produce a one-off version of the thrift definition, inject
> it into the parquet-thrift-format-structures build, update the Parquet API
> to add a VariantLogicalTypeAnnotation, and then find a way to get that
> temporary build into a Maven repository so that PRs can depend on it. Those
> PRs can’t be merged because they would rely on an unpublished/snapshot Jar.
>
> This is becoming a big headache and I don’t think that it helps us deliver
> Variant reliably. We know that we are going to support Variant data and the
> type annotation to label that data is going to be needed. There is little
> risk in committing that annotation now. The argument against adding it now
> was that it may create confusion, but I think it is unlikely that anyone
> would see the annotation in parquet-java or parquet-format and assume
> support guarantees. The spec is clearly marked experimental and the
> annotation is referenced already by LogicalTypes.md
> <
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant
> >.
> Adding the annotation to the API doesn’t significantly increase the number
> of people seeing it, and the people that do are already working with
> Parquet internals.
>
> The only other objection I know about is that we haven’t yet finalized what
> goes in the annotation, which I agree is a prerequisite for moving forward.
> I suggest that we add the encoding/shredding version (1), release
> parquet-format, and start working on the API extensions in parquet-java.
> Then we can actually work on a Java implementation that stores data in
> files in the parquet-java project.
>
> Does that sound reasonable?
>
> Ryan
>

Re: Next steps for Variant

Reply via email to