Re: [DISCUSS] Example VARIANT parquet files

Aihua Xu Mon, 07 Apr 2025 10:55:01 -0700

I have created some test files attached during the development, one is
without shredding and one is with shredding.


As David pointed out,  it's missing the Variant logical type but you can
use that as reference and as a start.

On Mon, Apr 7, 2025 at 3:50 AM Andrew Lamb <andrewlam...@gmail.com> wrote:

> I filed a ticket to track this work[1] and also perhaps to gather some
> additional help / collaboration.
>
> [1]: https://github.com/apache/parquet-testing/issues/75
>
> On Mon, Apr 7, 2025 at 6:02 AM Andrew Lamb <andrewlam...@gmail.com> wrote:
>
> > Thank you very much David
> >
> > I will try to create some examples this week and report back.
> >
> > Andrew
> >
> > On Sun, Apr 6, 2025 at 4:48 PM David Cashman
> > <david.cash...@databricks.com.invalid> wrote:
> >
> >> Hi Andrew, you should be able to create shredded files using OSS Spark
> >> 4.0. I think the only issue is that it doesn't have the logical type
> >> annotation yet, so readers wouldn't be able to distinguish it from a
> >> non-variant struct that happens to have the same schema. (Spark is
> >> able to infer that it is a Variant from the
> >> `org.apache.spark.sql.parquet.row.metadata` metadata.)
> >>
> >> The ParquetVariantShreddingSuite in Spark has some tests that write
> >> and read shredded parquet files. Below is an example that translates
> >> the first test into code that runs in spark-shell and writes a Parquet
> >> file. The shredding schema is set via conf. If you want to test types
> >> that Spark doesn't infer in parse_json (e.g. timestamp, binary), you
> >> can use `to_variant_object` to cast structured values to Variant.
> >>
> >> I won't have time to work on this in the next couple of weeks, but am
> >> happy to answer any questions.
> >>
> >> Thanks,
> >>
> >> David
> >>
> >> scala> import org.apache.spark.sql.internal.SQLConf
> >> scala> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key, true)
> >> scala> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, true)
> >> scala>
> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key,
> >> "a int, b string, c decimal(15, 1)")
> >> scala> val df = spark.sql(
> >>      |       """
> >>      |         | select case
> >>      |         | when id = 0 then parse_json('{"a": 1, "b": "2", "c":
> >> 3.3, "d": 4.4}')
> >>      |         | when id = 1 then parse_json('{"a": [1,2,3], "b":
> >> "hello", "c": {"x": 0}}')
> >>      |         | when id = 2 then parse_json('{"A": 1, "c": 1.23}')
> >>      |         | end v from range(3)
> >>      |         |""".stripMargin)
> >> scala> df.write.mode("overwrite").parquet("/tmp/shredded_test")
> >> scala> spark.read.parquet("/tmp/shredded_test").show
> >> +--------------------+
> >> |                   v|
> >> +--------------------+
> >> |{"a":1,"b":"2","c...|
> >> |{"a":[1,2,3],"b":...|
> >> |    {"A":1,"c":1.23}|
> >> +--------------------+
> >>
> >>
> >> On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb <andrewlam...@gmail.com>
> >> wrote:
> >> >
> >> > Can someone (pretty pretty) please give us some binary examples so we
> >> can
> >> > make faster progress on the Rust implementation?
> >> >
> >> > We recently got exciting news[1] that folks from the CMU database
> group
> >> > have started working on the Rust implementation of variant, and I
> would
> >> > very much like to encourage and support their work.
> >> >
> >> > I am willing to do some legwork (make a PR to parquet-testing for
> >> example)
> >> > if someone can point me to the files (or instructions on how to use
> some
> >> > system to create variants).
> >> >
> >> > I was hoping that since the VARIANT format[2] and draft shredding
> >> spec[3]
> >> > have been in the repo for 6 months (since October 2024) , it would be
> >> > straightforward to provide some examples. Do we know anything that is
> >> > blocking the creation of examples?
> >> >
> >> > Andrew
> >> >
> >> > [1]:
> >> https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103
> >> > [2]:
> >> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> >> > [3]:
> >> >
> >>
> https://github.com/apache/parquet-format/blob/master/VariantShredding.md
> >> >
> >> >
> >> > On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem <jul...@apache.org>
> wrote:
> >> >
> >> > > That sounds like a great suggestion to me.
> >> > >
> >> > > On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb <andrewlam...@gmail.com
> >
> >> > > wrote:
> >> > >
> >> > > > I would like to request before the VARIANT spec changes are
> >> finalized
> >> > > that
> >> > > > we have example data in parquet-testing.
> >> > > >
> >> > > > This topic came up (well, I brought it up) on the sync call today.
> >> > > >
> >> > > > In my opinion, having example files would reduce the overhead of
> new
> >> > > > implementations dramatically. At least there should be example of
> >> > > > * variant columns (no shredding)
> >> > > > * variant columns with shredding
> >> > > >
> >> > > > Some description of what those files contained ("expected
> >> contents"). For
> >> > > > prior art, here is what Dewey did for the geometry type[1][2].
> >> > > >
> >> > > > When looking for prior discussions, I found a great quote from
> Gang
> >> Wu[3]
> >> > > > on this topic:
> >> > > >
> >> > > > >  I'd say that a lesson learned is that we should publish example
> >> files
> >> > > > for any
> >> > > > > new feature to the parquet-testing [1] repo for interoperability
> >> tests.
> >> > > >
> >> > > > Thank you for your consideration,
> >> > > > Andrew
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > [1] https://github.com/apache/parquet-testing/pull/70
> >> > > > [2] https://github.com/geoarrow/geoarrow-data
> >> > > > [3]:
> >> https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt
> >> > > >
> >> > >
> >>
> >
>

Re: [DISCUSS] Example VARIANT parquet files

Reply via email to