Thank you very much David I will try to create some examples this week and report back.
Andrew On Sun, Apr 6, 2025 at 4:48 PM David Cashman <david.cash...@databricks.com.invalid> wrote: > Hi Andrew, you should be able to create shredded files using OSS Spark > 4.0. I think the only issue is that it doesn't have the logical type > annotation yet, so readers wouldn't be able to distinguish it from a > non-variant struct that happens to have the same schema. (Spark is > able to infer that it is a Variant from the > `org.apache.spark.sql.parquet.row.metadata` metadata.) > > The ParquetVariantShreddingSuite in Spark has some tests that write > and read shredded parquet files. Below is an example that translates > the first test into code that runs in spark-shell and writes a Parquet > file. The shredding schema is set via conf. If you want to test types > that Spark doesn't infer in parse_json (e.g. timestamp, binary), you > can use `to_variant_object` to cast structured values to Variant. > > I won't have time to work on this in the next couple of weeks, but am > happy to answer any questions. > > Thanks, > > David > > scala> import org.apache.spark.sql.internal.SQLConf > scala> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key, true) > scala> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, true) > scala> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key, > "a int, b string, c decimal(15, 1)") > scala> val df = spark.sql( > | """ > | | select case > | | when id = 0 then parse_json('{"a": 1, "b": "2", "c": > 3.3, "d": 4.4}') > | | when id = 1 then parse_json('{"a": [1,2,3], "b": > "hello", "c": {"x": 0}}') > | | when id = 2 then parse_json('{"A": 1, "c": 1.23}') > | | end v from range(3) > | |""".stripMargin) > scala> df.write.mode("overwrite").parquet("/tmp/shredded_test") > scala> spark.read.parquet("/tmp/shredded_test").show > +--------------------+ > | v| > +--------------------+ > |{"a":1,"b":"2","c...| > |{"a":[1,2,3],"b":...| > | {"A":1,"c":1.23}| > +--------------------+ > > > On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb <andrewlam...@gmail.com> wrote: > > > > Can someone (pretty pretty) please give us some binary examples so we can > > make faster progress on the Rust implementation? > > > > We recently got exciting news[1] that folks from the CMU database group > > have started working on the Rust implementation of variant, and I would > > very much like to encourage and support their work. > > > > I am willing to do some legwork (make a PR to parquet-testing for > example) > > if someone can point me to the files (or instructions on how to use some > > system to create variants). > > > > I was hoping that since the VARIANT format[2] and draft shredding spec[3] > > have been in the repo for 6 months (since October 2024) , it would be > > straightforward to provide some examples. Do we know anything that is > > blocking the creation of examples? > > > > Andrew > > > > [1]: > https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103 > > [2]: > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md > > [3]: > > https://github.com/apache/parquet-format/blob/master/VariantShredding.md > > > > > > On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem <jul...@apache.org> wrote: > > > > > That sounds like a great suggestion to me. > > > > > > On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb <andrewlam...@gmail.com> > > > wrote: > > > > > > > I would like to request before the VARIANT spec changes are finalized > > > that > > > > we have example data in parquet-testing. > > > > > > > > This topic came up (well, I brought it up) on the sync call today. > > > > > > > > In my opinion, having example files would reduce the overhead of new > > > > implementations dramatically. At least there should be example of > > > > * variant columns (no shredding) > > > > * variant columns with shredding > > > > > > > > Some description of what those files contained ("expected > contents"). For > > > > prior art, here is what Dewey did for the geometry type[1][2]. > > > > > > > > When looking for prior discussions, I found a great quote from Gang > Wu[3] > > > > on this topic: > > > > > > > > > I'd say that a lesson learned is that we should publish example > files > > > > for any > > > > > new feature to the parquet-testing [1] repo for interoperability > tests. > > > > > > > > Thank you for your consideration, > > > > Andrew > > > > > > > > > > > > > > > > > > > > [1] https://github.com/apache/parquet-testing/pull/70 > > > > [2] https://github.com/geoarrow/geoarrow-data > > > > [3]: > https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt > > > > > > > >