Since we haven’t released any parquet version with variant logical type, I don’t think any engine would write that yet.
Are you getting blocked by this? Can you workaround the issue for now? Thanks Aihua > On May 8, 2025, at 12:57 PM, Andrew Lamb <andrewlam...@gmail.com> wrote: > > Update here: the initial example files[1] were merged (thanks Micah!) > > However, Spark 4.0 (and what is on the main branch) does not appear to > write parquet files with the Logical annotations yet (it uses its own > metadata it seems) -- you can see what I tried in [2]. > > **Does anyone know of a system that can write Variant values with the > proper Parquet logical type?** > > Thanks, > Andrew > > > > [1]: https://github.com/apache/parquet-testing/pull/76 > [2]: > https://github.com/apache/parquet-testing/issues/75#issuecomment-2847815424 > > > >> On Fri, May 2, 2025 at 2:22 PM Andrew Lamb <andrewlam...@gmail.com> wrote: >> >> Thanks to Micah, I think we have the first PR with example Variant >> values[1] almost ready to merge. >> >> Next up will be figuring out how to create Parquet files with the proper >> logical annotations. >> >> Andrew >> >> p.s. In case it isn't obvious I would like introducing Variant into >> Parquet to be a model of how to extend the spec and get wide adoption >> across the ecosystem quickly, for two reasons: >> 1. the actual Variant funtionality >> 2. To counteract the narrative that Parquet is ossified and not possible >> to change. >> >> I personally think adding the binary examples is critical to >> helping other language implementations. >> >> [1]: https://github.com/apache/parquet-testing/pull/76 >> >> On Wed, Apr 16, 2025 at 10:59 AM Andrew Lamb <andrewlam...@gmail.com> >> wrote: >> >>> Update here is I have created a PR with example variant values (not yet >>> parquet files, just the variant values)[1]. >>> >>> Since Spark seems to be the only open source software capable of creating >>> variants at this time, I generated the examples using Spark. >>> >>> Please check it out and let me know what you think. If it is acceptable I >>> can work on PRs (based on Aihua's example) for actual parquet files with >>> encoded values >>> >>> Andrew >>> >>> [1]: https://github.com/apache/parquet-testing/pull/76 >>> >>> On Tue, Apr 8, 2025 at 5:46 AM Andrew Lamb <andrewlam...@gmail.com> >>> wrote: >>> >>>> Thank you very much >>>> >>>> On Mon, Apr 7, 2025 at 5:10 PM Aihua Xu <aihu...@gmail.com> wrote: >>>> >>>>> I attached in the https://github.com/apache/parquet-testing/issues/75. >>>>> Please rename them back to *.parquet so you can use parquet tools to >>>>> view >>>>> them. >>>>> >>>>> I captured them when working on the Iceberg tests in >>>>> >>>>> https://github.com/apache/iceberg/blob/main/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java#L172 >>>>> . >>>>> >>>>> You can change to OutputFile outputFile = >>>>> Files.localOutput("primitive.parquet"); to capture them, but you >>>>> probably >>>>> can follow what David mentioned. >>>>> >>>>> On Mon, Apr 7, 2025 at 12:26 PM Andrew Lamb <andrewlam...@gmail.com> >>>>> wrote: >>>>> >>>>>> Attaching them on the ticket[1] would also be a way to share them >>>>>> >>>>>> It would also be super helpful to share the commands you ran >>>>>> >>>>>> Andrew >>>>>> >>>>>> [1]: https://github.com/apache/parquet-testing/issues/75 >>>>>> >>>>>> On Mon, Apr 7, 2025 at 2:46 PM Adrian Garcia Badaracco >>>>>> <adr...@pydantic.dev.invalid> wrote: >>>>>> >>>>>>> Yes I am not able to see them. Could you make a PR to the repo, or >>>>> upload >>>>>>> them somewhere so we can make a PR? Even if it doesn’t get merged >>>>>>> immediately we can pull them from the PR. Thanks! >>>>>>> >>>>>>> On Mon, Apr 7, 2025 at 1:44 PM Aihua Xu <aihu...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Adrian, >>>>>>>> >>>>>>>> I attached them to my reply and I'm not sure if the files get >>>>> filtered. >>>>>>> Let >>>>>>>> me know if you still can't see them. Maybe I should push to the >>>>> repo >>>>>>>> instead. >>>>>>>> >>>>>>>> On Mon, Apr 7, 2025 at 10:58 AM Adrian Garcia Badaracco >>>>>>>> <adr...@pydantic.dev.invalid> wrote: >>>>>>>> >>>>>>>>> Amazing Aihua, thanks so much! >>>>>>>>> >>>>>>>>> Sorry if I just missed it but... where are the files you >>>>> created? I >>>>>>> don't >>>>>>>>> see them in the repo / issue / this thread. >>>>>>>>> >>>>>>>>> On Mon, Apr 7, 2025 at 12:54 PM Aihua Xu <aihu...@gmail.com> >>>>> wrote: >>>>>>>>> >>>>>>>>>> I have created some test files attached during the >>>>> development, one >>>>>>> is >>>>>>>>>> without shredding and one is with shredding. >>>>>>>>>> >>>>>>>>>> As David pointed out, it's missing the Variant logical type >>>>> but >>>>>> you >>>>>>>> can >>>>>>>>>> use that as reference and as a start. >>>>>>>>>> >>>>>>>>>> On Mon, Apr 7, 2025 at 3:50 AM Andrew Lamb < >>>>> andrewlam...@gmail.com >>>>>>> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I filed a ticket to track this work[1] and also perhaps to >>>>> gather >>>>>>> some >>>>>>>>>>> additional help / collaboration. >>>>>>>>>>> >>>>>>>>>>> [1]: https://github.com/apache/parquet-testing/issues/75 >>>>>>>>>>> >>>>>>>>>>> On Mon, Apr 7, 2025 at 6:02 AM Andrew Lamb < >>>>>> andrewlam...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thank you very much David >>>>>>>>>>>> >>>>>>>>>>>> I will try to create some examples this week and report >>>>> back. >>>>>>>>>>>> >>>>>>>>>>>> Andrew >>>>>>>>>>>> >>>>>>>>>>>> On Sun, Apr 6, 2025 at 4:48 PM David Cashman >>>>>>>>>>>> <david.cash...@databricks.com.invalid> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Andrew, you should be able to create shredded files >>>>> using >>>>>> OSS >>>>>>>>> Spark >>>>>>>>>>>>> 4.0. I think the only issue is that it doesn't have the >>>>> logical >>>>>>>> type >>>>>>>>>>>>> annotation yet, so readers wouldn't be able to >>>>> distinguish it >>>>>>> from >>>>>>>> a >>>>>>>>>>>>> non-variant struct that happens to have the same schema. >>>>> (Spark >>>>>>> is >>>>>>>>>>>>> able to infer that it is a Variant from the >>>>>>>>>>>>> `org.apache.spark.sql.parquet.row.metadata` metadata.) >>>>>>>>>>>>> >>>>>>>>>>>>> The ParquetVariantShreddingSuite in Spark has some tests >>>>> that >>>>>>> write >>>>>>>>>>>>> and read shredded parquet files. Below is an example that >>>>>>>> translates >>>>>>>>>>>>> the first test into code that runs in spark-shell and >>>>> writes a >>>>>>>>> Parquet >>>>>>>>>>>>> file. The shredding schema is set via conf. If you want >>>>> to test >>>>>>>> types >>>>>>>>>>>>> that Spark doesn't infer in parse_json (e.g. timestamp, >>>>>> binary), >>>>>>>> you >>>>>>>>>>>>> can use `to_variant_object` to cast structured values to >>>>>> Variant. >>>>>>>>>>>>> >>>>>>>>>>>>> I won't have time to work on this in the next couple of >>>>> weeks, >>>>>>> but >>>>>>>> am >>>>>>>>>>>>> happy to answer any questions. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> David >>>>>>>>>>>>> >>>>>>>>>>>>> scala> import org.apache.spark.sql.internal.SQLConf >>>>>>>>>>>>> scala> >>>>>>> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key, >>>>>>>>>>> true) >>>>>>>>>>>>> scala> >>>>>> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, >>>>>>>>> true) >>>>>>>>>>>>> scala> >>>>>>>>>>> >>>>>> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key, >>>>>>>>>>>>> "a int, b string, c decimal(15, 1)") >>>>>>>>>>>>> scala> val df = spark.sql( >>>>>>>>>>>>> | """ >>>>>>>>>>>>> | | select case >>>>>>>>>>>>> | | when id = 0 then parse_json('{"a": 1, >>>>> "b": >>>>>> "2", >>>>>>>> "c": >>>>>>>>>>>>> 3.3, "d": 4.4}') >>>>>>>>>>>>> | | when id = 1 then parse_json('{"a": >>>>> [1,2,3], >>>>>> "b": >>>>>>>>>>>>> "hello", "c": {"x": 0}}') >>>>>>>>>>>>> | | when id = 2 then parse_json('{"A": 1, >>>>> "c": >>>>>>> 1.23}') >>>>>>>>>>>>> | | end v from range(3) >>>>>>>>>>>>> | |""".stripMargin) >>>>>>>>>>>>> scala> >>>>> df.write.mode("overwrite").parquet("/tmp/shredded_test") >>>>>>>>>>>>> scala> spark.read.parquet("/tmp/shredded_test").show >>>>>>>>>>>>> +--------------------+ >>>>>>>>>>>>> | v| >>>>>>>>>>>>> +--------------------+ >>>>>>>>>>>>> |{"a":1,"b":"2","c...| >>>>>>>>>>>>> |{"a":[1,2,3],"b":...| >>>>>>>>>>>>> | {"A":1,"c":1.23}| >>>>>>>>>>>>> +--------------------+ >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb < >>>>>>> andrewlam...@gmail.com >>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Can someone (pretty pretty) please give us some binary >>>>>> examples >>>>>>>> so >>>>>>>>> we >>>>>>>>>>>>> can >>>>>>>>>>>>>> make faster progress on the Rust implementation? >>>>>>>>>>>>>> >>>>>>>>>>>>>> We recently got exciting news[1] that folks from the CMU >>>>>>> database >>>>>>>>>>> group >>>>>>>>>>>>>> have started working on the Rust implementation of >>>>> variant, >>>>>>> and I >>>>>>>>>>> would >>>>>>>>>>>>>> very much like to encourage and support their work. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am willing to do some legwork (make a PR to >>>>> parquet-testing >>>>>>> for >>>>>>>>>>>>> example) >>>>>>>>>>>>>> if someone can point me to the files (or instructions >>>>> on how >>>>>> to >>>>>>>> use >>>>>>>>>>> some >>>>>>>>>>>>>> system to create variants). >>>>>>>>>>>>>> >>>>>>>>>>>>>> I was hoping that since the VARIANT format[2] and draft >>>>>>> shredding >>>>>>>>>>>>> spec[3] >>>>>>>>>>>>>> have been in the repo for 6 months (since October 2024) >>>>> , it >>>>>>>> would >>>>>>>>> be >>>>>>>>>>>>>> straightforward to provide some examples. Do we know >>>>> anything >>>>>>>> that >>>>>>>>> is >>>>>>>>>>>>>> blocking the creation of examples? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Andrew >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1]: >>>>>>>>>>>>> >>>>>>>>> >>>>>> >>>>> https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103 >>>>>>>>>>>>>> [2]: >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>> >>>>> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md >>>>>>>>>>>>>> [3]: >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> https://github.com/apache/parquet-format/blob/master/VariantShredding.md >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem < >>>>>>> jul...@apache.org> >>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> That sounds like a great suggestion to me. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb < >>>>>>>>>>> andrewlam...@gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I would like to request before the VARIANT spec >>>>> changes >>>>>> are >>>>>>>>>>>>> finalized >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>> we have example data in parquet-testing. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This topic came up (well, I brought it up) on the >>>>> sync >>>>>> call >>>>>>>>>>> today. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In my opinion, having example files would reduce the >>>>>>> overhead >>>>>>>>> of >>>>>>>>>>> new >>>>>>>>>>>>>>>> implementations dramatically. At least there should >>>>> be >>>>>>>> example >>>>>>>>> of >>>>>>>>>>>>>>>> * variant columns (no shredding) >>>>>>>>>>>>>>>> * variant columns with shredding >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Some description of what those files contained >>>>> ("expected >>>>>>>>>>>>> contents"). For >>>>>>>>>>>>>>>> prior art, here is what Dewey did for the geometry >>>>>>>> type[1][2]. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> When looking for prior discussions, I found a great >>>>> quote >>>>>>>> from >>>>>>>>>>> Gang >>>>>>>>>>>>> Wu[3] >>>>>>>>>>>>>>>> on this topic: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'd say that a lesson learned is that we should >>>>>> publish >>>>>>>>>>> example >>>>>>>>>>>>> files >>>>>>>>>>>>>>>> for any >>>>>>>>>>>>>>>>> new feature to the parquet-testing [1] repo for >>>>>>>>>>> interoperability >>>>>>>>>>>>> tests. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thank you for your consideration, >>>>>>>>>>>>>>>> Andrew >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [1] >>>>> https://github.com/apache/parquet-testing/pull/70 >>>>>>>>>>>>>>>> [2] https://github.com/geoarrow/geoarrow-data >>>>>>>>>>>>>>>> [3]: >>>>>>>>>>>>> >>>>>> https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>