I attached in the https://github.com/apache/parquet-testing/issues/75. Please rename them back to *.parquet so you can use parquet tools to view them.
I captured them when working on the Iceberg tests in https://github.com/apache/iceberg/blob/main/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java#L172 . You can change to OutputFile outputFile = Files.localOutput("primitive.parquet"); to capture them, but you probably can follow what David mentioned. On Mon, Apr 7, 2025 at 12:26 PM Andrew Lamb <andrewlam...@gmail.com> wrote: > Attaching them on the ticket[1] would also be a way to share them > > It would also be super helpful to share the commands you ran > > Andrew > > [1]: https://github.com/apache/parquet-testing/issues/75 > > On Mon, Apr 7, 2025 at 2:46 PM Adrian Garcia Badaracco > <adr...@pydantic.dev.invalid> wrote: > > > Yes I am not able to see them. Could you make a PR to the repo, or upload > > them somewhere so we can make a PR? Even if it doesn’t get merged > > immediately we can pull them from the PR. Thanks! > > > > On Mon, Apr 7, 2025 at 1:44 PM Aihua Xu <aihu...@gmail.com> wrote: > > > > > Hi Adrian, > > > > > > I attached them to my reply and I'm not sure if the files get filtered. > > Let > > > me know if you still can't see them. Maybe I should push to the repo > > > instead. > > > > > > On Mon, Apr 7, 2025 at 10:58 AM Adrian Garcia Badaracco > > > <adr...@pydantic.dev.invalid> wrote: > > > > > > > Amazing Aihua, thanks so much! > > > > > > > > Sorry if I just missed it but... where are the files you created? I > > don't > > > > see them in the repo / issue / this thread. > > > > > > > > On Mon, Apr 7, 2025 at 12:54 PM Aihua Xu <aihu...@gmail.com> wrote: > > > > > > > > > I have created some test files attached during the development, one > > is > > > > > without shredding and one is with shredding. > > > > > > > > > > As David pointed out, it's missing the Variant logical type but > you > > > can > > > > > use that as reference and as a start. > > > > > > > > > > On Mon, Apr 7, 2025 at 3:50 AM Andrew Lamb <andrewlam...@gmail.com > > > > > > wrote: > > > > > > > > > >> I filed a ticket to track this work[1] and also perhaps to gather > > some > > > > >> additional help / collaboration. > > > > >> > > > > >> [1]: https://github.com/apache/parquet-testing/issues/75 > > > > >> > > > > >> On Mon, Apr 7, 2025 at 6:02 AM Andrew Lamb < > andrewlam...@gmail.com> > > > > >> wrote: > > > > >> > > > > >> > Thank you very much David > > > > >> > > > > > >> > I will try to create some examples this week and report back. > > > > >> > > > > > >> > Andrew > > > > >> > > > > > >> > On Sun, Apr 6, 2025 at 4:48 PM David Cashman > > > > >> > <david.cash...@databricks.com.invalid> wrote: > > > > >> > > > > > >> >> Hi Andrew, you should be able to create shredded files using > OSS > > > > Spark > > > > >> >> 4.0. I think the only issue is that it doesn't have the logical > > > type > > > > >> >> annotation yet, so readers wouldn't be able to distinguish it > > from > > > a > > > > >> >> non-variant struct that happens to have the same schema. (Spark > > is > > > > >> >> able to infer that it is a Variant from the > > > > >> >> `org.apache.spark.sql.parquet.row.metadata` metadata.) > > > > >> >> > > > > >> >> The ParquetVariantShreddingSuite in Spark has some tests that > > write > > > > >> >> and read shredded parquet files. Below is an example that > > > translates > > > > >> >> the first test into code that runs in spark-shell and writes a > > > > Parquet > > > > >> >> file. The shredding schema is set via conf. If you want to test > > > types > > > > >> >> that Spark doesn't infer in parse_json (e.g. timestamp, > binary), > > > you > > > > >> >> can use `to_variant_object` to cast structured values to > Variant. > > > > >> >> > > > > >> >> I won't have time to work on this in the next couple of weeks, > > but > > > am > > > > >> >> happy to answer any questions. > > > > >> >> > > > > >> >> Thanks, > > > > >> >> > > > > >> >> David > > > > >> >> > > > > >> >> scala> import org.apache.spark.sql.internal.SQLConf > > > > >> >> scala> > > spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key, > > > > >> true) > > > > >> >> scala> > spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, > > > > true) > > > > >> >> scala> > > > > >> > spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key, > > > > >> >> "a int, b string, c decimal(15, 1)") > > > > >> >> scala> val df = spark.sql( > > > > >> >> | """ > > > > >> >> | | select case > > > > >> >> | | when id = 0 then parse_json('{"a": 1, "b": > "2", > > > "c": > > > > >> >> 3.3, "d": 4.4}') > > > > >> >> | | when id = 1 then parse_json('{"a": [1,2,3], > "b": > > > > >> >> "hello", "c": {"x": 0}}') > > > > >> >> | | when id = 2 then parse_json('{"A": 1, "c": > > 1.23}') > > > > >> >> | | end v from range(3) > > > > >> >> | |""".stripMargin) > > > > >> >> scala> df.write.mode("overwrite").parquet("/tmp/shredded_test") > > > > >> >> scala> spark.read.parquet("/tmp/shredded_test").show > > > > >> >> +--------------------+ > > > > >> >> | v| > > > > >> >> +--------------------+ > > > > >> >> |{"a":1,"b":"2","c...| > > > > >> >> |{"a":[1,2,3],"b":...| > > > > >> >> | {"A":1,"c":1.23}| > > > > >> >> +--------------------+ > > > > >> >> > > > > >> >> > > > > >> >> On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb < > > andrewlam...@gmail.com > > > > > > > > >> >> wrote: > > > > >> >> > > > > > >> >> > Can someone (pretty pretty) please give us some binary > examples > > > so > > > > we > > > > >> >> can > > > > >> >> > make faster progress on the Rust implementation? > > > > >> >> > > > > > >> >> > We recently got exciting news[1] that folks from the CMU > > database > > > > >> group > > > > >> >> > have started working on the Rust implementation of variant, > > and I > > > > >> would > > > > >> >> > very much like to encourage and support their work. > > > > >> >> > > > > > >> >> > I am willing to do some legwork (make a PR to parquet-testing > > for > > > > >> >> example) > > > > >> >> > if someone can point me to the files (or instructions on how > to > > > use > > > > >> some > > > > >> >> > system to create variants). > > > > >> >> > > > > > >> >> > I was hoping that since the VARIANT format[2] and draft > > shredding > > > > >> >> spec[3] > > > > >> >> > have been in the repo for 6 months (since October 2024) , it > > > would > > > > be > > > > >> >> > straightforward to provide some examples. Do we know anything > > > that > > > > is > > > > >> >> > blocking the creation of examples? > > > > >> >> > > > > > >> >> > Andrew > > > > >> >> > > > > > >> >> > [1]: > > > > >> >> > > > > > https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103 > > > > >> >> > [2]: > > > > >> >> > > > > >> > > > > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md > > > > >> >> > [3]: > > > > >> >> > > > > > >> >> > > > > >> > > > > > > https://github.com/apache/parquet-format/blob/master/VariantShredding.md > > > > >> >> > > > > > >> >> > > > > > >> >> > On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem < > > jul...@apache.org> > > > > >> wrote: > > > > >> >> > > > > > >> >> > > That sounds like a great suggestion to me. > > > > >> >> > > > > > > >> >> > > On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb < > > > > >> andrewlam...@gmail.com> > > > > >> >> > > wrote: > > > > >> >> > > > > > > >> >> > > > I would like to request before the VARIANT spec changes > are > > > > >> >> finalized > > > > >> >> > > that > > > > >> >> > > > we have example data in parquet-testing. > > > > >> >> > > > > > > > >> >> > > > This topic came up (well, I brought it up) on the sync > call > > > > >> today. > > > > >> >> > > > > > > > >> >> > > > In my opinion, having example files would reduce the > > overhead > > > > of > > > > >> new > > > > >> >> > > > implementations dramatically. At least there should be > > > example > > > > of > > > > >> >> > > > * variant columns (no shredding) > > > > >> >> > > > * variant columns with shredding > > > > >> >> > > > > > > > >> >> > > > Some description of what those files contained ("expected > > > > >> >> contents"). For > > > > >> >> > > > prior art, here is what Dewey did for the geometry > > > type[1][2]. > > > > >> >> > > > > > > > >> >> > > > When looking for prior discussions, I found a great quote > > > from > > > > >> Gang > > > > >> >> Wu[3] > > > > >> >> > > > on this topic: > > > > >> >> > > > > > > > >> >> > > > > I'd say that a lesson learned is that we should > publish > > > > >> example > > > > >> >> files > > > > >> >> > > > for any > > > > >> >> > > > > new feature to the parquet-testing [1] repo for > > > > >> interoperability > > > > >> >> tests. > > > > >> >> > > > > > > > >> >> > > > Thank you for your consideration, > > > > >> >> > > > Andrew > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> > > > [1] https://github.com/apache/parquet-testing/pull/70 > > > > >> >> > > > [2] https://github.com/geoarrow/geoarrow-data > > > > >> >> > > > [3]: > > > > >> >> > https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt > > > > >> >> > > > > > > > >> >> > > > > > > >> >> > > > > >> > > > > > >> > > > > > > > > > > > > > > >