Thanks Aihua, We are not (yet) blocked on development by the logical type annotation, though I am blocked from creating example parquet files
Do you think it would be possible to create annotated parquet files using arrow-cpp now that Neil's PR[1] has been merged? I haven't really tried yet Andrew [1]: https://github.com/apache/arrow/pull/45375 On Thu, May 8, 2025 at 9:59 PM Aihua Xu <aihu...@gmail.com> wrote: > Since we haven’t released any parquet version with variant logical type, I > don’t think any engine would write that yet. > > Are you getting blocked by this? Can you workaround the issue for now? > > Thanks > Aihua > > > > On May 8, 2025, at 12:57 PM, Andrew Lamb <andrewlam...@gmail.com> wrote: > > > > Update here: the initial example files[1] were merged (thanks Micah!) > > > > However, Spark 4.0 (and what is on the main branch) does not appear to > > write parquet files with the Logical annotations yet (it uses its own > > metadata it seems) -- you can see what I tried in [2]. > > > > **Does anyone know of a system that can write Variant values with the > > proper Parquet logical type?** > > > > Thanks, > > Andrew > > > > > > > > [1]: https://github.com/apache/parquet-testing/pull/76 > > [2]: > > > https://github.com/apache/parquet-testing/issues/75#issuecomment-2847815424 > > > > > > > >> On Fri, May 2, 2025 at 2:22 PM Andrew Lamb <andrewlam...@gmail.com> > wrote: > >> > >> Thanks to Micah, I think we have the first PR with example Variant > >> values[1] almost ready to merge. > >> > >> Next up will be figuring out how to create Parquet files with the proper > >> logical annotations. > >> > >> Andrew > >> > >> p.s. In case it isn't obvious I would like introducing Variant into > >> Parquet to be a model of how to extend the spec and get wide adoption > >> across the ecosystem quickly, for two reasons: > >> 1. the actual Variant funtionality > >> 2. To counteract the narrative that Parquet is ossified and not possible > >> to change. > >> > >> I personally think adding the binary examples is critical to > >> helping other language implementations. > >> > >> [1]: https://github.com/apache/parquet-testing/pull/76 > >> > >> On Wed, Apr 16, 2025 at 10:59 AM Andrew Lamb <andrewlam...@gmail.com> > >> wrote: > >> > >>> Update here is I have created a PR with example variant values (not yet > >>> parquet files, just the variant values)[1]. > >>> > >>> Since Spark seems to be the only open source software capable of > creating > >>> variants at this time, I generated the examples using Spark. > >>> > >>> Please check it out and let me know what you think. If it is > acceptable I > >>> can work on PRs (based on Aihua's example) for actual parquet files > with > >>> encoded values > >>> > >>> Andrew > >>> > >>> [1]: https://github.com/apache/parquet-testing/pull/76 > >>> > >>> On Tue, Apr 8, 2025 at 5:46 AM Andrew Lamb <andrewlam...@gmail.com> > >>> wrote: > >>> > >>>> Thank you very much > >>>> > >>>> On Mon, Apr 7, 2025 at 5:10 PM Aihua Xu <aihu...@gmail.com> wrote: > >>>> > >>>>> I attached in the > https://github.com/apache/parquet-testing/issues/75. > >>>>> Please rename them back to *.parquet so you can use parquet tools to > >>>>> view > >>>>> them. > >>>>> > >>>>> I captured them when working on the Iceberg tests in > >>>>> > >>>>> > https://github.com/apache/iceberg/blob/main/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java#L172 > >>>>> . > >>>>> > >>>>> You can change to OutputFile outputFile = > >>>>> Files.localOutput("primitive.parquet"); to capture them, but you > >>>>> probably > >>>>> can follow what David mentioned. > >>>>> > >>>>> On Mon, Apr 7, 2025 at 12:26 PM Andrew Lamb <andrewlam...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> Attaching them on the ticket[1] would also be a way to share them > >>>>>> > >>>>>> It would also be super helpful to share the commands you ran > >>>>>> > >>>>>> Andrew > >>>>>> > >>>>>> [1]: https://github.com/apache/parquet-testing/issues/75 > >>>>>> > >>>>>> On Mon, Apr 7, 2025 at 2:46 PM Adrian Garcia Badaracco > >>>>>> <adr...@pydantic.dev.invalid> wrote: > >>>>>> > >>>>>>> Yes I am not able to see them. Could you make a PR to the repo, or > >>>>> upload > >>>>>>> them somewhere so we can make a PR? Even if it doesn’t get merged > >>>>>>> immediately we can pull them from the PR. Thanks! > >>>>>>> > >>>>>>> On Mon, Apr 7, 2025 at 1:44 PM Aihua Xu <aihu...@gmail.com> wrote: > >>>>>>> > >>>>>>>> Hi Adrian, > >>>>>>>> > >>>>>>>> I attached them to my reply and I'm not sure if the files get > >>>>> filtered. > >>>>>>> Let > >>>>>>>> me know if you still can't see them. Maybe I should push to the > >>>>> repo > >>>>>>>> instead. > >>>>>>>> > >>>>>>>> On Mon, Apr 7, 2025 at 10:58 AM Adrian Garcia Badaracco > >>>>>>>> <adr...@pydantic.dev.invalid> wrote: > >>>>>>>> > >>>>>>>>> Amazing Aihua, thanks so much! > >>>>>>>>> > >>>>>>>>> Sorry if I just missed it but... where are the files you > >>>>> created? I > >>>>>>> don't > >>>>>>>>> see them in the repo / issue / this thread. > >>>>>>>>> > >>>>>>>>> On Mon, Apr 7, 2025 at 12:54 PM Aihua Xu <aihu...@gmail.com> > >>>>> wrote: > >>>>>>>>> > >>>>>>>>>> I have created some test files attached during the > >>>>> development, one > >>>>>>> is > >>>>>>>>>> without shredding and one is with shredding. > >>>>>>>>>> > >>>>>>>>>> As David pointed out, it's missing the Variant logical type > >>>>> but > >>>>>> you > >>>>>>>> can > >>>>>>>>>> use that as reference and as a start. > >>>>>>>>>> > >>>>>>>>>> On Mon, Apr 7, 2025 at 3:50 AM Andrew Lamb < > >>>>> andrewlam...@gmail.com > >>>>>>> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> I filed a ticket to track this work[1] and also perhaps to > >>>>> gather > >>>>>>> some > >>>>>>>>>>> additional help / collaboration. > >>>>>>>>>>> > >>>>>>>>>>> [1]: https://github.com/apache/parquet-testing/issues/75 > >>>>>>>>>>> > >>>>>>>>>>> On Mon, Apr 7, 2025 at 6:02 AM Andrew Lamb < > >>>>>> andrewlam...@gmail.com> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Thank you very much David > >>>>>>>>>>>> > >>>>>>>>>>>> I will try to create some examples this week and report > >>>>> back. > >>>>>>>>>>>> > >>>>>>>>>>>> Andrew > >>>>>>>>>>>> > >>>>>>>>>>>> On Sun, Apr 6, 2025 at 4:48 PM David Cashman > >>>>>>>>>>>> <david.cash...@databricks.com.invalid> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi Andrew, you should be able to create shredded files > >>>>> using > >>>>>> OSS > >>>>>>>>> Spark > >>>>>>>>>>>>> 4.0. I think the only issue is that it doesn't have the > >>>>> logical > >>>>>>>> type > >>>>>>>>>>>>> annotation yet, so readers wouldn't be able to > >>>>> distinguish it > >>>>>>> from > >>>>>>>> a > >>>>>>>>>>>>> non-variant struct that happens to have the same schema. > >>>>> (Spark > >>>>>>> is > >>>>>>>>>>>>> able to infer that it is a Variant from the > >>>>>>>>>>>>> `org.apache.spark.sql.parquet.row.metadata` metadata.) > >>>>>>>>>>>>> > >>>>>>>>>>>>> The ParquetVariantShreddingSuite in Spark has some tests > >>>>> that > >>>>>>> write > >>>>>>>>>>>>> and read shredded parquet files. Below is an example that > >>>>>>>> translates > >>>>>>>>>>>>> the first test into code that runs in spark-shell and > >>>>> writes a > >>>>>>>>> Parquet > >>>>>>>>>>>>> file. The shredding schema is set via conf. If you want > >>>>> to test > >>>>>>>> types > >>>>>>>>>>>>> that Spark doesn't infer in parse_json (e.g. timestamp, > >>>>>> binary), > >>>>>>>> you > >>>>>>>>>>>>> can use `to_variant_object` to cast structured values to > >>>>>> Variant. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I won't have time to work on this in the next couple of > >>>>> weeks, > >>>>>>> but > >>>>>>>> am > >>>>>>>>>>>>> happy to answer any questions. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>> > >>>>>>>>>>>>> David > >>>>>>>>>>>>> > >>>>>>>>>>>>> scala> import org.apache.spark.sql.internal.SQLConf > >>>>>>>>>>>>> scala> > >>>>>>> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key, > >>>>>>>>>>> true) > >>>>>>>>>>>>> scala> > >>>>>> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, > >>>>>>>>> true) > >>>>>>>>>>>>> scala> > >>>>>>>>>>> > >>>>>> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key, > >>>>>>>>>>>>> "a int, b string, c decimal(15, 1)") > >>>>>>>>>>>>> scala> val df = spark.sql( > >>>>>>>>>>>>> | """ > >>>>>>>>>>>>> | | select case > >>>>>>>>>>>>> | | when id = 0 then parse_json('{"a": 1, > >>>>> "b": > >>>>>> "2", > >>>>>>>> "c": > >>>>>>>>>>>>> 3.3, "d": 4.4}') > >>>>>>>>>>>>> | | when id = 1 then parse_json('{"a": > >>>>> [1,2,3], > >>>>>> "b": > >>>>>>>>>>>>> "hello", "c": {"x": 0}}') > >>>>>>>>>>>>> | | when id = 2 then parse_json('{"A": 1, > >>>>> "c": > >>>>>>> 1.23}') > >>>>>>>>>>>>> | | end v from range(3) > >>>>>>>>>>>>> | |""".stripMargin) > >>>>>>>>>>>>> scala> > >>>>> df.write.mode("overwrite").parquet("/tmp/shredded_test") > >>>>>>>>>>>>> scala> spark.read.parquet("/tmp/shredded_test").show > >>>>>>>>>>>>> +--------------------+ > >>>>>>>>>>>>> | v| > >>>>>>>>>>>>> +--------------------+ > >>>>>>>>>>>>> |{"a":1,"b":"2","c...| > >>>>>>>>>>>>> |{"a":[1,2,3],"b":...| > >>>>>>>>>>>>> | {"A":1,"c":1.23}| > >>>>>>>>>>>>> +--------------------+ > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb < > >>>>>>> andrewlam...@gmail.com > >>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Can someone (pretty pretty) please give us some binary > >>>>>> examples > >>>>>>>> so > >>>>>>>>> we > >>>>>>>>>>>>> can > >>>>>>>>>>>>>> make faster progress on the Rust implementation? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> We recently got exciting news[1] that folks from the CMU > >>>>>>> database > >>>>>>>>>>> group > >>>>>>>>>>>>>> have started working on the Rust implementation of > >>>>> variant, > >>>>>>> and I > >>>>>>>>>>> would > >>>>>>>>>>>>>> very much like to encourage and support their work. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I am willing to do some legwork (make a PR to > >>>>> parquet-testing > >>>>>>> for > >>>>>>>>>>>>> example) > >>>>>>>>>>>>>> if someone can point me to the files (or instructions > >>>>> on how > >>>>>> to > >>>>>>>> use > >>>>>>>>>>> some > >>>>>>>>>>>>>> system to create variants). > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I was hoping that since the VARIANT format[2] and draft > >>>>>>> shredding > >>>>>>>>>>>>> spec[3] > >>>>>>>>>>>>>> have been in the repo for 6 months (since October 2024) > >>>>> , it > >>>>>>>> would > >>>>>>>>> be > >>>>>>>>>>>>>> straightforward to provide some examples. Do we know > >>>>> anything > >>>>>>>> that > >>>>>>>>> is > >>>>>>>>>>>>>> blocking the creation of examples? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Andrew > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> [1]: > >>>>>>>>>>>>> > >>>>>>>>> > >>>>>> > >>>>> > https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103 > >>>>>>>>>>>>>> [2]: > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md > >>>>>>>>>>>>>> [3]: > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>> > https://github.com/apache/parquet-format/blob/master/VariantShredding.md > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem < > >>>>>>> jul...@apache.org> > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> That sounds like a great suggestion to me. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb < > >>>>>>>>>>> andrewlam...@gmail.com> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I would like to request before the VARIANT spec > >>>>> changes > >>>>>> are > >>>>>>>>>>>>> finalized > >>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>> we have example data in parquet-testing. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> This topic came up (well, I brought it up) on the > >>>>> sync > >>>>>> call > >>>>>>>>>>> today. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> In my opinion, having example files would reduce the > >>>>>>> overhead > >>>>>>>>> of > >>>>>>>>>>> new > >>>>>>>>>>>>>>>> implementations dramatically. At least there should > >>>>> be > >>>>>>>> example > >>>>>>>>> of > >>>>>>>>>>>>>>>> * variant columns (no shredding) > >>>>>>>>>>>>>>>> * variant columns with shredding > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Some description of what those files contained > >>>>> ("expected > >>>>>>>>>>>>> contents"). For > >>>>>>>>>>>>>>>> prior art, here is what Dewey did for the geometry > >>>>>>>> type[1][2]. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> When looking for prior discussions, I found a great > >>>>> quote > >>>>>>>> from > >>>>>>>>>>> Gang > >>>>>>>>>>>>> Wu[3] > >>>>>>>>>>>>>>>> on this topic: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I'd say that a lesson learned is that we should > >>>>>> publish > >>>>>>>>>>> example > >>>>>>>>>>>>> files > >>>>>>>>>>>>>>>> for any > >>>>>>>>>>>>>>>>> new feature to the parquet-testing [1] repo for > >>>>>>>>>>> interoperability > >>>>>>>>>>>>> tests. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thank you for your consideration, > >>>>>>>>>>>>>>>> Andrew > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> [1] > >>>>> https://github.com/apache/parquet-testing/pull/70 > >>>>>>>>>>>>>>>> [2] https://github.com/geoarrow/geoarrow-data > >>>>>>>>>>>>>>>> [3]: > >>>>>>>>>>>>> > >>>>>> https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> >