Re: [DISCUSS] Example VARIANT parquet files

David Cashman Sun, 06 Apr 2025 13:13:13 -0700

Hi Andrew, you should be able to create shredded files using OSS Spark
4.0. I think the only issue is that it doesn't have the logical type
annotation yet, so readers wouldn't be able to distinguish it from a
non-variant struct that happens to have the same schema. (Spark is
able to infer that it is a Variant from the
`org.apache.spark.sql.parquet.row.metadata` metadata.)


The ParquetVariantShreddingSuite in Spark has some tests that write
and read shredded parquet files. Below is an example that translates
the first test into code that runs in spark-shell and writes a Parquet
file. The shredding schema is set via conf. If you want to test types
that Spark doesn't infer in parse_json (e.g. timestamp, binary), you
can use `to_variant_object` to cast structured values to Variant.

I won't have time to work on this in the next couple of weeks, but am
happy to answer any questions.

Thanks,

David

scala> import org.apache.spark.sql.internal.SQLConf
scala> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key, true)
scala> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, true)
scala> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key,
"a int, b string, c decimal(15, 1)")
scala> val df = spark.sql(
     |       """
     |         | select case
     |         | when id = 0 then parse_json('{"a": 1, "b": "2", "c":
3.3, "d": 4.4}')
     |         | when id = 1 then parse_json('{"a": [1,2,3], "b":
"hello", "c": {"x": 0}}')
     |         | when id = 2 then parse_json('{"A": 1, "c": 1.23}')
     |         | end v from range(3)
     |         |""".stripMargin)
scala> df.write.mode("overwrite").parquet("/tmp/shredded_test")
scala> spark.read.parquet("/tmp/shredded_test").show
+--------------------+
|                   v|
+--------------------+
|{"a":1,"b":"2","c...|
|{"a":[1,2,3],"b":...|
|    {"A":1,"c":1.23}|
+--------------------+


On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb <andrewlam...@gmail.com> wrote:
>
> Can someone (pretty pretty) please give us some binary examples so we can
> make faster progress on the Rust implementation?
>
> We recently got exciting news[1] that folks from the CMU database group
> have started working on the Rust implementation of variant, and I would
> very much like to encourage and support their work.
>
> I am willing to do some legwork (make a PR to parquet-testing for example)
> if someone can point me to the files (or instructions on how to use some
> system to create variants).
>
> I was hoping that since the VARIANT format[2] and draft shredding spec[3]
> have been in the repo for 6 months (since October 2024) , it would be
> straightforward to provide some examples. Do we know anything that is
> blocking the creation of examples?
>
> Andrew
>
> [1]: https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103
> [2]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> [3]:
> https://github.com/apache/parquet-format/blob/master/VariantShredding.md
>
>
> On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem <jul...@apache.org> wrote:
>
> > That sounds like a great suggestion to me.
> >
> > On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb <andrewlam...@gmail.com>
> > wrote:
> >
> > > I would like to request before the VARIANT spec changes are finalized
> > that
> > > we have example data in parquet-testing.
> > >
> > > This topic came up (well, I brought it up) on the sync call today.
> > >
> > > In my opinion, having example files would reduce the overhead of new
> > > implementations dramatically. At least there should be example of
> > > * variant columns (no shredding)
> > > * variant columns with shredding
> > >
> > > Some description of what those files contained ("expected contents"). For
> > > prior art, here is what Dewey did for the geometry type[1][2].
> > >
> > > When looking for prior discussions, I found a great quote from Gang Wu[3]
> > > on this topic:
> > >
> > > >  I'd say that a lesson learned is that we should publish example files
> > > for any
> > > > new feature to the parquet-testing [1] repo for interoperability tests.
> > >
> > > Thank you for your consideration,
> > > Andrew
> > >
> > >
> > >
> > >
> > > [1] https://github.com/apache/parquet-testing/pull/70
> > > [2] https://github.com/geoarrow/geoarrow-data
> > > [3]: https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt
> > >
> >

Re: [DISCUSS] Example VARIANT parquet files

Reply via email to