Re: [DISCUSS] Example VARIANT parquet files

Adrian Garcia Badaracco Mon, 07 Apr 2025 11:47:54 -0700

Yes I am not able to see them. Could you make a PR to the repo, or upload
them somewhere so we can make a PR? Even if it doesn’t get merged
immediately we can pull them from the PR. Thanks!


On Mon, Apr 7, 2025 at 1:44 PM Aihua Xu <[email protected]> wrote:

> Hi Adrian,
>
> I attached them to my reply and I'm not sure if the files get filtered. Let
> me know if you still can't see them. Maybe I should push to the repo
> instead.
>
> On Mon, Apr 7, 2025 at 10:58 AM Adrian Garcia Badaracco
> <[email protected]> wrote:
>
> > Amazing Aihua, thanks so much!
> >
> > Sorry if I just missed it but... where are the files you created? I don't
> > see them in the repo / issue / this thread.
> >
> > On Mon, Apr 7, 2025 at 12:54 PM Aihua Xu <[email protected]> wrote:
> >
> > > I have created some test files attached during the development, one is
> > > without shredding and one is with shredding.
> > >
> > > As David pointed out,  it's missing the Variant logical type but you
> can
> > > use that as reference and as a start.
> > >
> > > On Mon, Apr 7, 2025 at 3:50 AM Andrew Lamb <[email protected]>
> > wrote:
> > >
> > >> I filed a ticket to track this work[1] and also perhaps to gather some
> > >> additional help / collaboration.
> > >>
> > >> [1]: https://github.com/apache/parquet-testing/issues/75
> > >>
> > >> On Mon, Apr 7, 2025 at 6:02 AM Andrew Lamb <[email protected]>
> > >> wrote:
> > >>
> > >> > Thank you very much David
> > >> >
> > >> > I will try to create some examples this week and report back.
> > >> >
> > >> > Andrew
> > >> >
> > >> > On Sun, Apr 6, 2025 at 4:48 PM David Cashman
> > >> > <[email protected]> wrote:
> > >> >
> > >> >> Hi Andrew, you should be able to create shredded files using OSS
> > Spark
> > >> >> 4.0. I think the only issue is that it doesn't have the logical
> type
> > >> >> annotation yet, so readers wouldn't be able to distinguish it from
> a
> > >> >> non-variant struct that happens to have the same schema. (Spark is
> > >> >> able to infer that it is a Variant from the
> > >> >> `org.apache.spark.sql.parquet.row.metadata` metadata.)
> > >> >>
> > >> >> The ParquetVariantShreddingSuite in Spark has some tests that write
> > >> >> and read shredded parquet files. Below is an example that
> translates
> > >> >> the first test into code that runs in spark-shell and writes a
> > Parquet
> > >> >> file. The shredding schema is set via conf. If you want to test
> types
> > >> >> that Spark doesn't infer in parse_json (e.g. timestamp, binary),
> you
> > >> >> can use `to_variant_object` to cast structured values to Variant.
> > >> >>
> > >> >> I won't have time to work on this in the next couple of weeks, but
> am
> > >> >> happy to answer any questions.
> > >> >>
> > >> >> Thanks,
> > >> >>
> > >> >> David
> > >> >>
> > >> >> scala> import org.apache.spark.sql.internal.SQLConf
> > >> >> scala> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key,
> > >> true)
> > >> >> scala> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key,
> > true)
> > >> >> scala>
> > >> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key,
> > >> >> "a int, b string, c decimal(15, 1)")
> > >> >> scala> val df = spark.sql(
> > >> >>      |       """
> > >> >>      |         | select case
> > >> >>      |         | when id = 0 then parse_json('{"a": 1, "b": "2",
> "c":
> > >> >> 3.3, "d": 4.4}')
> > >> >>      |         | when id = 1 then parse_json('{"a": [1,2,3], "b":
> > >> >> "hello", "c": {"x": 0}}')
> > >> >>      |         | when id = 2 then parse_json('{"A": 1, "c": 1.23}')
> > >> >>      |         | end v from range(3)
> > >> >>      |         |""".stripMargin)
> > >> >> scala> df.write.mode("overwrite").parquet("/tmp/shredded_test")
> > >> >> scala> spark.read.parquet("/tmp/shredded_test").show
> > >> >> +--------------------+
> > >> >> |                   v|
> > >> >> +--------------------+
> > >> >> |{"a":1,"b":"2","c...|
> > >> >> |{"a":[1,2,3],"b":...|
> > >> >> |    {"A":1,"c":1.23}|
> > >> >> +--------------------+
> > >> >>
> > >> >>
> > >> >> On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb <[email protected]
> >
> > >> >> wrote:
> > >> >> >
> > >> >> > Can someone (pretty pretty) please give us some binary examples
> so
> > we
> > >> >> can
> > >> >> > make faster progress on the Rust implementation?
> > >> >> >
> > >> >> > We recently got exciting news[1] that folks from the CMU database
> > >> group
> > >> >> > have started working on the Rust implementation of variant, and I
> > >> would
> > >> >> > very much like to encourage and support their work.
> > >> >> >
> > >> >> > I am willing to do some legwork (make a PR to parquet-testing for
> > >> >> example)
> > >> >> > if someone can point me to the files (or instructions on how to
> use
> > >> some
> > >> >> > system to create variants).
> > >> >> >
> > >> >> > I was hoping that since the VARIANT format[2] and draft shredding
> > >> >> spec[3]
> > >> >> > have been in the repo for 6 months (since October 2024) , it
> would
> > be
> > >> >> > straightforward to provide some examples. Do we know anything
> that
> > is
> > >> >> > blocking the creation of examples?
> > >> >> >
> > >> >> > Andrew
> > >> >> >
> > >> >> > [1]:
> > >> >>
> > https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103
> > >> >> > [2]:
> > >> >>
> > >>
> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> > >> >> > [3]:
> > >> >> >
> > >> >>
> > >>
> > https://github.com/apache/parquet-format/blob/master/VariantShredding.md
> > >> >> >
> > >> >> >
> > >> >> > On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem <[email protected]>
> > >> wrote:
> > >> >> >
> > >> >> > > That sounds like a great suggestion to me.
> > >> >> > >
> > >> >> > > On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb <
> > >> [email protected]>
> > >> >> > > wrote:
> > >> >> > >
> > >> >> > > > I would like to request before the VARIANT spec changes are
> > >> >> finalized
> > >> >> > > that
> > >> >> > > > we have example data in parquet-testing.
> > >> >> > > >
> > >> >> > > > This topic came up (well, I brought it up) on the sync call
> > >> today.
> > >> >> > > >
> > >> >> > > > In my opinion, having example files would reduce the overhead
> > of
> > >> new
> > >> >> > > > implementations dramatically. At least there should be
> example
> > of
> > >> >> > > > * variant columns (no shredding)
> > >> >> > > > * variant columns with shredding
> > >> >> > > >
> > >> >> > > > Some description of what those files contained ("expected
> > >> >> contents"). For
> > >> >> > > > prior art, here is what Dewey did for the geometry
> type[1][2].
> > >> >> > > >
> > >> >> > > > When looking for prior discussions, I found a great quote
> from
> > >> Gang
> > >> >> Wu[3]
> > >> >> > > > on this topic:
> > >> >> > > >
> > >> >> > > > >  I'd say that a lesson learned is that we should publish
> > >> example
> > >> >> files
> > >> >> > > > for any
> > >> >> > > > > new feature to the parquet-testing [1] repo for
> > >> interoperability
> > >> >> tests.
> > >> >> > > >
> > >> >> > > > Thank you for your consideration,
> > >> >> > > > Andrew
> > >> >> > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > > [1] https://github.com/apache/parquet-testing/pull/70
> > >> >> > > > [2] https://github.com/geoarrow/geoarrow-data
> > >> >> > > > [3]:
> > >> >> https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt
> > >> >> > > >
> > >> >> > >
> > >> >>
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] Example VARIANT parquet files

Reply via email to