Thanks to Micah, I think we have the first PR with example Variant values[1] almost ready to merge.
Next up will be figuring out how to create Parquet files with the proper logical annotations. Andrew p.s. In case it isn't obvious I would like introducing Variant into Parquet to be a model of how to extend the spec and get wide adoption across the ecosystem quickly, for two reasons: 1. the actual Variant funtionality 2. To counteract the narrative that Parquet is ossified and not possible to change. I personally think adding the binary examples is critical to helping other language implementations. [1]: https://github.com/apache/parquet-testing/pull/76 On Wed, Apr 16, 2025 at 10:59 AM Andrew Lamb <andrewlam...@gmail.com> wrote: > Update here is I have created a PR with example variant values (not yet > parquet files, just the variant values)[1]. > > Since Spark seems to be the only open source software capable of creating > variants at this time, I generated the examples using Spark. > > Please check it out and let me know what you think. If it is acceptable I > can work on PRs (based on Aihua's example) for actual parquet files with > encoded values > > Andrew > > [1]: https://github.com/apache/parquet-testing/pull/76 > > On Tue, Apr 8, 2025 at 5:46 AM Andrew Lamb <andrewlam...@gmail.com> wrote: > >> Thank you very much >> >> On Mon, Apr 7, 2025 at 5:10 PM Aihua Xu <aihu...@gmail.com> wrote: >> >>> I attached in the https://github.com/apache/parquet-testing/issues/75. >>> Please rename them back to *.parquet so you can use parquet tools to view >>> them. >>> >>> I captured them when working on the Iceberg tests in >>> >>> https://github.com/apache/iceberg/blob/main/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java#L172 >>> . >>> >>> You can change to OutputFile outputFile = >>> Files.localOutput("primitive.parquet"); to capture them, but you probably >>> can follow what David mentioned. >>> >>> On Mon, Apr 7, 2025 at 12:26 PM Andrew Lamb <andrewlam...@gmail.com> >>> wrote: >>> >>> > Attaching them on the ticket[1] would also be a way to share them >>> > >>> > It would also be super helpful to share the commands you ran >>> > >>> > Andrew >>> > >>> > [1]: https://github.com/apache/parquet-testing/issues/75 >>> > >>> > On Mon, Apr 7, 2025 at 2:46 PM Adrian Garcia Badaracco >>> > <adr...@pydantic.dev.invalid> wrote: >>> > >>> > > Yes I am not able to see them. Could you make a PR to the repo, or >>> upload >>> > > them somewhere so we can make a PR? Even if it doesn’t get merged >>> > > immediately we can pull them from the PR. Thanks! >>> > > >>> > > On Mon, Apr 7, 2025 at 1:44 PM Aihua Xu <aihu...@gmail.com> wrote: >>> > > >>> > > > Hi Adrian, >>> > > > >>> > > > I attached them to my reply and I'm not sure if the files get >>> filtered. >>> > > Let >>> > > > me know if you still can't see them. Maybe I should push to the >>> repo >>> > > > instead. >>> > > > >>> > > > On Mon, Apr 7, 2025 at 10:58 AM Adrian Garcia Badaracco >>> > > > <adr...@pydantic.dev.invalid> wrote: >>> > > > >>> > > > > Amazing Aihua, thanks so much! >>> > > > > >>> > > > > Sorry if I just missed it but... where are the files you >>> created? I >>> > > don't >>> > > > > see them in the repo / issue / this thread. >>> > > > > >>> > > > > On Mon, Apr 7, 2025 at 12:54 PM Aihua Xu <aihu...@gmail.com> >>> wrote: >>> > > > > >>> > > > > > I have created some test files attached during the >>> development, one >>> > > is >>> > > > > > without shredding and one is with shredding. >>> > > > > > >>> > > > > > As David pointed out, it's missing the Variant logical type >>> but >>> > you >>> > > > can >>> > > > > > use that as reference and as a start. >>> > > > > > >>> > > > > > On Mon, Apr 7, 2025 at 3:50 AM Andrew Lamb < >>> andrewlam...@gmail.com >>> > > >>> > > > > wrote: >>> > > > > > >>> > > > > >> I filed a ticket to track this work[1] and also perhaps to >>> gather >>> > > some >>> > > > > >> additional help / collaboration. >>> > > > > >> >>> > > > > >> [1]: https://github.com/apache/parquet-testing/issues/75 >>> > > > > >> >>> > > > > >> On Mon, Apr 7, 2025 at 6:02 AM Andrew Lamb < >>> > andrewlam...@gmail.com> >>> > > > > >> wrote: >>> > > > > >> >>> > > > > >> > Thank you very much David >>> > > > > >> > >>> > > > > >> > I will try to create some examples this week and report >>> back. >>> > > > > >> > >>> > > > > >> > Andrew >>> > > > > >> > >>> > > > > >> > On Sun, Apr 6, 2025 at 4:48 PM David Cashman >>> > > > > >> > <david.cash...@databricks.com.invalid> wrote: >>> > > > > >> > >>> > > > > >> >> Hi Andrew, you should be able to create shredded files >>> using >>> > OSS >>> > > > > Spark >>> > > > > >> >> 4.0. I think the only issue is that it doesn't have the >>> logical >>> > > > type >>> > > > > >> >> annotation yet, so readers wouldn't be able to distinguish >>> it >>> > > from >>> > > > a >>> > > > > >> >> non-variant struct that happens to have the same schema. >>> (Spark >>> > > is >>> > > > > >> >> able to infer that it is a Variant from the >>> > > > > >> >> `org.apache.spark.sql.parquet.row.metadata` metadata.) >>> > > > > >> >> >>> > > > > >> >> The ParquetVariantShreddingSuite in Spark has some tests >>> that >>> > > write >>> > > > > >> >> and read shredded parquet files. Below is an example that >>> > > > translates >>> > > > > >> >> the first test into code that runs in spark-shell and >>> writes a >>> > > > > Parquet >>> > > > > >> >> file. The shredding schema is set via conf. If you want to >>> test >>> > > > types >>> > > > > >> >> that Spark doesn't infer in parse_json (e.g. timestamp, >>> > binary), >>> > > > you >>> > > > > >> >> can use `to_variant_object` to cast structured values to >>> > Variant. >>> > > > > >> >> >>> > > > > >> >> I won't have time to work on this in the next couple of >>> weeks, >>> > > but >>> > > > am >>> > > > > >> >> happy to answer any questions. >>> > > > > >> >> >>> > > > > >> >> Thanks, >>> > > > > >> >> >>> > > > > >> >> David >>> > > > > >> >> >>> > > > > >> >> scala> import org.apache.spark.sql.internal.SQLConf >>> > > > > >> >> scala> >>> > > spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key, >>> > > > > >> true) >>> > > > > >> >> scala> >>> > spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, >>> > > > > true) >>> > > > > >> >> scala> >>> > > > > >> >>> > spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key, >>> > > > > >> >> "a int, b string, c decimal(15, 1)") >>> > > > > >> >> scala> val df = spark.sql( >>> > > > > >> >> | """ >>> > > > > >> >> | | select case >>> > > > > >> >> | | when id = 0 then parse_json('{"a": 1, "b": >>> > "2", >>> > > > "c": >>> > > > > >> >> 3.3, "d": 4.4}') >>> > > > > >> >> | | when id = 1 then parse_json('{"a": >>> [1,2,3], >>> > "b": >>> > > > > >> >> "hello", "c": {"x": 0}}') >>> > > > > >> >> | | when id = 2 then parse_json('{"A": 1, "c": >>> > > 1.23}') >>> > > > > >> >> | | end v from range(3) >>> > > > > >> >> | |""".stripMargin) >>> > > > > >> >> scala> >>> df.write.mode("overwrite").parquet("/tmp/shredded_test") >>> > > > > >> >> scala> spark.read.parquet("/tmp/shredded_test").show >>> > > > > >> >> +--------------------+ >>> > > > > >> >> | v| >>> > > > > >> >> +--------------------+ >>> > > > > >> >> |{"a":1,"b":"2","c...| >>> > > > > >> >> |{"a":[1,2,3],"b":...| >>> > > > > >> >> | {"A":1,"c":1.23}| >>> > > > > >> >> +--------------------+ >>> > > > > >> >> >>> > > > > >> >> >>> > > > > >> >> On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb < >>> > > andrewlam...@gmail.com >>> > > > > >>> > > > > >> >> wrote: >>> > > > > >> >> > >>> > > > > >> >> > Can someone (pretty pretty) please give us some binary >>> > examples >>> > > > so >>> > > > > we >>> > > > > >> >> can >>> > > > > >> >> > make faster progress on the Rust implementation? >>> > > > > >> >> > >>> > > > > >> >> > We recently got exciting news[1] that folks from the CMU >>> > > database >>> > > > > >> group >>> > > > > >> >> > have started working on the Rust implementation of >>> variant, >>> > > and I >>> > > > > >> would >>> > > > > >> >> > very much like to encourage and support their work. >>> > > > > >> >> > >>> > > > > >> >> > I am willing to do some legwork (make a PR to >>> parquet-testing >>> > > for >>> > > > > >> >> example) >>> > > > > >> >> > if someone can point me to the files (or instructions on >>> how >>> > to >>> > > > use >>> > > > > >> some >>> > > > > >> >> > system to create variants). >>> > > > > >> >> > >>> > > > > >> >> > I was hoping that since the VARIANT format[2] and draft >>> > > shredding >>> > > > > >> >> spec[3] >>> > > > > >> >> > have been in the repo for 6 months (since October 2024) >>> , it >>> > > > would >>> > > > > be >>> > > > > >> >> > straightforward to provide some examples. Do we know >>> anything >>> > > > that >>> > > > > is >>> > > > > >> >> > blocking the creation of examples? >>> > > > > >> >> > >>> > > > > >> >> > Andrew >>> > > > > >> >> > >>> > > > > >> >> > [1]: >>> > > > > >> >> >>> > > > > >>> > https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103 >>> > > > > >> >> > [2]: >>> > > > > >> >> >>> > > > > >> >>> > > > >>> > >>> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md >>> > > > > >> >> > [3]: >>> > > > > >> >> > >>> > > > > >> >> >>> > > > > >> >>> > > > > >>> > > >>> https://github.com/apache/parquet-format/blob/master/VariantShredding.md >>> > > > > >> >> > >>> > > > > >> >> > >>> > > > > >> >> > On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem < >>> > > jul...@apache.org> >>> > > > > >> wrote: >>> > > > > >> >> > >>> > > > > >> >> > > That sounds like a great suggestion to me. >>> > > > > >> >> > > >>> > > > > >> >> > > On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb < >>> > > > > >> andrewlam...@gmail.com> >>> > > > > >> >> > > wrote: >>> > > > > >> >> > > >>> > > > > >> >> > > > I would like to request before the VARIANT spec >>> changes >>> > are >>> > > > > >> >> finalized >>> > > > > >> >> > > that >>> > > > > >> >> > > > we have example data in parquet-testing. >>> > > > > >> >> > > > >>> > > > > >> >> > > > This topic came up (well, I brought it up) on the >>> sync >>> > call >>> > > > > >> today. >>> > > > > >> >> > > > >>> > > > > >> >> > > > In my opinion, having example files would reduce the >>> > > overhead >>> > > > > of >>> > > > > >> new >>> > > > > >> >> > > > implementations dramatically. At least there should >>> be >>> > > > example >>> > > > > of >>> > > > > >> >> > > > * variant columns (no shredding) >>> > > > > >> >> > > > * variant columns with shredding >>> > > > > >> >> > > > >>> > > > > >> >> > > > Some description of what those files contained >>> ("expected >>> > > > > >> >> contents"). For >>> > > > > >> >> > > > prior art, here is what Dewey did for the geometry >>> > > > type[1][2]. >>> > > > > >> >> > > > >>> > > > > >> >> > > > When looking for prior discussions, I found a great >>> quote >>> > > > from >>> > > > > >> Gang >>> > > > > >> >> Wu[3] >>> > > > > >> >> > > > on this topic: >>> > > > > >> >> > > > >>> > > > > >> >> > > > > I'd say that a lesson learned is that we should >>> > publish >>> > > > > >> example >>> > > > > >> >> files >>> > > > > >> >> > > > for any >>> > > > > >> >> > > > > new feature to the parquet-testing [1] repo for >>> > > > > >> interoperability >>> > > > > >> >> tests. >>> > > > > >> >> > > > >>> > > > > >> >> > > > Thank you for your consideration, >>> > > > > >> >> > > > Andrew >>> > > > > >> >> > > > >>> > > > > >> >> > > > >>> > > > > >> >> > > > >>> > > > > >> >> > > > >>> > > > > >> >> > > > [1] >>> https://github.com/apache/parquet-testing/pull/70 >>> > > > > >> >> > > > [2] https://github.com/geoarrow/geoarrow-data >>> > > > > >> >> > > > [3]: >>> > > > > >> >> >>> > https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt >>> > > > > >> >> > > > >>> > > > > >> >> > > >>> > > > > >> >> >>> > > > > >> > >>> > > > > >> >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >>