Re: [DISCUSS] Example VARIANT parquet files

Adrian Garcia Badaracco Thu, 10 Apr 2025 14:24:30 -0700

Amazing Aihua, thanks so much!

Sorry if I just missed it but... where are the files you created? I don't
see them in the repo / issue / this thread.


On Mon, Apr 7, 2025 at 12:54 PM Aihua Xu <aihu...@gmail.com> wrote:

> I have created some test files attached during the development, one is
> without shredding and one is with shredding.
>
> As David pointed out,  it's missing the Variant logical type but you can
> use that as reference and as a start.
>
> On Mon, Apr 7, 2025 at 3:50 AM Andrew Lamb <andrewlam...@gmail.com> wrote:
>
>> I filed a ticket to track this work[1] and also perhaps to gather some
>> additional help / collaboration.
>>
>> [1]: https://github.com/apache/parquet-testing/issues/75
>>
>> On Mon, Apr 7, 2025 at 6:02 AM Andrew Lamb <andrewlam...@gmail.com>
>> wrote:
>>
>> > Thank you very much David
>> >
>> > I will try to create some examples this week and report back.
>> >
>> > Andrew
>> >
>> > On Sun, Apr 6, 2025 at 4:48 PM David Cashman
>> > <david.cash...@databricks.com.invalid> wrote:
>> >
>> >> Hi Andrew, you should be able to create shredded files using OSS Spark
>> >> 4.0. I think the only issue is that it doesn't have the logical type
>> >> annotation yet, so readers wouldn't be able to distinguish it from a
>> >> non-variant struct that happens to have the same schema. (Spark is
>> >> able to infer that it is a Variant from the
>> >> `org.apache.spark.sql.parquet.row.metadata` metadata.)
>> >>
>> >> The ParquetVariantShreddingSuite in Spark has some tests that write
>> >> and read shredded parquet files. Below is an example that translates
>> >> the first test into code that runs in spark-shell and writes a Parquet
>> >> file. The shredding schema is set via conf. If you want to test types
>> >> that Spark doesn't infer in parse_json (e.g. timestamp, binary), you
>> >> can use `to_variant_object` to cast structured values to Variant.
>> >>
>> >> I won't have time to work on this in the next couple of weeks, but am
>> >> happy to answer any questions.
>> >>
>> >> Thanks,
>> >>
>> >> David
>> >>
>> >> scala> import org.apache.spark.sql.internal.SQLConf
>> >> scala> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key,
>> true)
>> >> scala> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key, true)
>> >> scala>
>> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key,
>> >> "a int, b string, c decimal(15, 1)")
>> >> scala> val df = spark.sql(
>> >>      |       """
>> >>      |         | select case
>> >>      |         | when id = 0 then parse_json('{"a": 1, "b": "2", "c":
>> >> 3.3, "d": 4.4}')
>> >>      |         | when id = 1 then parse_json('{"a": [1,2,3], "b":
>> >> "hello", "c": {"x": 0}}')
>> >>      |         | when id = 2 then parse_json('{"A": 1, "c": 1.23}')
>> >>      |         | end v from range(3)
>> >>      |         |""".stripMargin)
>> >> scala> df.write.mode("overwrite").parquet("/tmp/shredded_test")
>> >> scala> spark.read.parquet("/tmp/shredded_test").show
>> >> +--------------------+
>> >> |                   v|
>> >> +--------------------+
>> >> |{"a":1,"b":"2","c...|
>> >> |{"a":[1,2,3],"b":...|
>> >> |    {"A":1,"c":1.23}|
>> >> +--------------------+
>> >>
>> >>
>> >> On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb <andrewlam...@gmail.com>
>> >> wrote:
>> >> >
>> >> > Can someone (pretty pretty) please give us some binary examples so we
>> >> can
>> >> > make faster progress on the Rust implementation?
>> >> >
>> >> > We recently got exciting news[1] that folks from the CMU database
>> group
>> >> > have started working on the Rust implementation of variant, and I
>> would
>> >> > very much like to encourage and support their work.
>> >> >
>> >> > I am willing to do some legwork (make a PR to parquet-testing for
>> >> example)
>> >> > if someone can point me to the files (or instructions on how to use
>> some
>> >> > system to create variants).
>> >> >
>> >> > I was hoping that since the VARIANT format[2] and draft shredding
>> >> spec[3]
>> >> > have been in the repo for 6 months (since October 2024) , it would be
>> >> > straightforward to provide some examples. Do we know anything that is
>> >> > blocking the creation of examples?
>> >> >
>> >> > Andrew
>> >> >
>> >> > [1]:
>> >> https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103
>> >> > [2]:
>> >>
>> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
>> >> > [3]:
>> >> >
>> >>
>> https://github.com/apache/parquet-format/blob/master/VariantShredding.md
>> >> >
>> >> >
>> >> > On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem <jul...@apache.org>
>> wrote:
>> >> >
>> >> > > That sounds like a great suggestion to me.
>> >> > >
>> >> > > On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb <
>> andrewlam...@gmail.com>
>> >> > > wrote:
>> >> > >
>> >> > > > I would like to request before the VARIANT spec changes are
>> >> finalized
>> >> > > that
>> >> > > > we have example data in parquet-testing.
>> >> > > >
>> >> > > > This topic came up (well, I brought it up) on the sync call
>> today.
>> >> > > >
>> >> > > > In my opinion, having example files would reduce the overhead of
>> new
>> >> > > > implementations dramatically. At least there should be example of
>> >> > > > * variant columns (no shredding)
>> >> > > > * variant columns with shredding
>> >> > > >
>> >> > > > Some description of what those files contained ("expected
>> >> contents"). For
>> >> > > > prior art, here is what Dewey did for the geometry type[1][2].
>> >> > > >
>> >> > > > When looking for prior discussions, I found a great quote from
>> Gang
>> >> Wu[3]
>> >> > > > on this topic:
>> >> > > >
>> >> > > > >  I'd say that a lesson learned is that we should publish
>> example
>> >> files
>> >> > > > for any
>> >> > > > > new feature to the parquet-testing [1] repo for
>> interoperability
>> >> tests.
>> >> > > >
>> >> > > > Thank you for your consideration,
>> >> > > > Andrew
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > [1] https://github.com/apache/parquet-testing/pull/70
>> >> > > > [2] https://github.com/geoarrow/geoarrow-data
>> >> > > > [3]:
>> >> https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt
>> >> > > >
>> >> > >
>> >>
>> >
>>
>

Re: [DISCUSS] Example VARIANT parquet files

Reply via email to