Amazing Aihua, thanks so much! Sorry if I just missed it but... where are the files you created? I don't see them in the repo / issue / this thread.
On Mon, Apr 7, 2025 at 12:54 PM Aihua Xu <aihu...@gmail.com> wrote: > I have created some test files attached during the development, one is > without shredding and one is with shredding. > > As David pointed out, it's missing the Variant logical type but you can > use that as reference and as a start. > > On Mon, Apr 7, 2025 at 3:50 AM Andrew Lamb <andrewlam...@gmail.com> wrote: > >> I filed a ticket to track this work[1] and also perhaps to gather some >> additional help / collaboration. >> >> [1]: https://github.com/apache/parquet-testing/issues/75 >> >> On Mon, Apr 7, 2025 at 6:02 AM Andrew Lamb <andrewlam...@gmail.com> >> wrote: >> >> > Thank you very much David >> > >> > I will try to create some examples this week and report back. >> > >> > Andrew >> > >> > On Sun, Apr 6, 2025 at 4:48 PM David Cashman >> > <david.cash...@databricks.com.invalid> wrote: >> > >> >> Hi Andrew, you should be able to create shredded files using OSS Spark >> >> 4.0. I think the only issue is that it doesn't have the logical type >> >> annotation yet, so readers wouldn't be able to distinguish it from a >> >> non-variant struct that happens to have the same schema. (Spark is >> >> able to infer that it is a Variant from the >> >> `org.apache.spark.sql.parquet.row.metadata` metadata.) >> >> >> >> The ParquetVariantShreddingSuite in Spark has some tests that write >> >> and read shredded parquet files. Below is an example that translates >> >> the first test into code that runs in spark-shell and writes a Parquet >> >> file. The shredding schema is set via conf. If you want to test types >> >> that Spark doesn't infer in parse_json (e.g. timestamp, binary), you >> >> can use `to_variant_object` to cast structured values to Variant. >> >> >> >> I won't have time to work on this in the next couple of weeks, but am >> >> happy to answer any questions. >> >> >> >> Thanks, >> >> >> >> David >> >> >> >> scala> import org.apache.spark.sql.internal.SQLConf >> >> scala> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key, >> true) >> >> scala> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, true) >> >> scala> >> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key, >> >> "a int, b string, c decimal(15, 1)") >> >> scala> val df = spark.sql( >> >> | """ >> >> | | select case >> >> | | when id = 0 then parse_json('{"a": 1, "b": "2", "c": >> >> 3.3, "d": 4.4}') >> >> | | when id = 1 then parse_json('{"a": [1,2,3], "b": >> >> "hello", "c": {"x": 0}}') >> >> | | when id = 2 then parse_json('{"A": 1, "c": 1.23}') >> >> | | end v from range(3) >> >> | |""".stripMargin) >> >> scala> df.write.mode("overwrite").parquet("/tmp/shredded_test") >> >> scala> spark.read.parquet("/tmp/shredded_test").show >> >> +--------------------+ >> >> | v| >> >> +--------------------+ >> >> |{"a":1,"b":"2","c...| >> >> |{"a":[1,2,3],"b":...| >> >> | {"A":1,"c":1.23}| >> >> +--------------------+ >> >> >> >> >> >> On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb <andrewlam...@gmail.com> >> >> wrote: >> >> > >> >> > Can someone (pretty pretty) please give us some binary examples so we >> >> can >> >> > make faster progress on the Rust implementation? >> >> > >> >> > We recently got exciting news[1] that folks from the CMU database >> group >> >> > have started working on the Rust implementation of variant, and I >> would >> >> > very much like to encourage and support their work. >> >> > >> >> > I am willing to do some legwork (make a PR to parquet-testing for >> >> example) >> >> > if someone can point me to the files (or instructions on how to use >> some >> >> > system to create variants). >> >> > >> >> > I was hoping that since the VARIANT format[2] and draft shredding >> >> spec[3] >> >> > have been in the repo for 6 months (since October 2024) , it would be >> >> > straightforward to provide some examples. Do we know anything that is >> >> > blocking the creation of examples? >> >> > >> >> > Andrew >> >> > >> >> > [1]: >> >> https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103 >> >> > [2]: >> >> >> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md >> >> > [3]: >> >> > >> >> >> https://github.com/apache/parquet-format/blob/master/VariantShredding.md >> >> > >> >> > >> >> > On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem <jul...@apache.org> >> wrote: >> >> > >> >> > > That sounds like a great suggestion to me. >> >> > > >> >> > > On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb < >> andrewlam...@gmail.com> >> >> > > wrote: >> >> > > >> >> > > > I would like to request before the VARIANT spec changes are >> >> finalized >> >> > > that >> >> > > > we have example data in parquet-testing. >> >> > > > >> >> > > > This topic came up (well, I brought it up) on the sync call >> today. >> >> > > > >> >> > > > In my opinion, having example files would reduce the overhead of >> new >> >> > > > implementations dramatically. At least there should be example of >> >> > > > * variant columns (no shredding) >> >> > > > * variant columns with shredding >> >> > > > >> >> > > > Some description of what those files contained ("expected >> >> contents"). For >> >> > > > prior art, here is what Dewey did for the geometry type[1][2]. >> >> > > > >> >> > > > When looking for prior discussions, I found a great quote from >> Gang >> >> Wu[3] >> >> > > > on this topic: >> >> > > > >> >> > > > > I'd say that a lesson learned is that we should publish >> example >> >> files >> >> > > > for any >> >> > > > > new feature to the parquet-testing [1] repo for >> interoperability >> >> tests. >> >> > > > >> >> > > > Thank you for your consideration, >> >> > > > Andrew >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > [1] https://github.com/apache/parquet-testing/pull/70 >> >> > > > [2] https://github.com/geoarrow/geoarrow-data >> >> > > > [3]: >> >> https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt >> >> > > > >> >> > > >> >> >> > >> >