Re: [DISCUSS] Example VARIANT parquet files

Andrew Lamb Fri, 02 May 2025 11:22:47 -0700

Thanks to Micah, I think we have the first PR with example Variant
values[1] almost ready to merge.


Next up will be figuring out how to create Parquet files with the proper
logical annotations.

Andrew

p.s. In case it isn't obvious I would like introducing Variant into Parquet
to be a model of how to extend the spec and get wide adoption across the
ecosystem quickly, for two reasons:
1.  the actual Variant funtionality
2. To counteract the narrative that Parquet is ossified and not possible to
change.

 I personally think adding the binary examples is critical to helping other
language implementations.

[1]: https://github.com/apache/parquet-testing/pull/76

On Wed, Apr 16, 2025 at 10:59 AM Andrew Lamb <andrewlam...@gmail.com> wrote:

> Update here is I have created a PR with example variant values (not yet
> parquet files, just the variant values)[1].
>
> Since Spark seems to be the only open source software capable of creating
> variants at this time, I generated the examples using Spark.
>
> Please check it out and let me know what you think. If it is acceptable I
> can work on PRs (based on Aihua's example) for actual parquet files with
> encoded values
>
> Andrew
>
> [1]: https://github.com/apache/parquet-testing/pull/76
>
> On Tue, Apr 8, 2025 at 5:46 AM Andrew Lamb <andrewlam...@gmail.com> wrote:
>
>> Thank you very much
>>
>> On Mon, Apr 7, 2025 at 5:10 PM Aihua Xu <aihu...@gmail.com> wrote:
>>
>>> I attached in the https://github.com/apache/parquet-testing/issues/75.
>>> Please rename them back to *.parquet so you can use parquet tools to view
>>> them.
>>>
>>> I captured them when working on the Iceberg tests  in
>>>
>>> https://github.com/apache/iceberg/blob/main/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java#L172
>>> .
>>>
>>> You can change to  OutputFile outputFile =
>>> Files.localOutput("primitive.parquet"); to capture them, but you probably
>>> can follow what David mentioned.
>>>
>>> On Mon, Apr 7, 2025 at 12:26 PM Andrew Lamb <andrewlam...@gmail.com>
>>> wrote:
>>>
>>> > Attaching them on the ticket[1] would also be a way to share them
>>> >
>>> > It would also be super helpful to share the commands you ran
>>> >
>>> > Andrew
>>> >
>>> > [1]: https://github.com/apache/parquet-testing/issues/75
>>> >
>>> > On Mon, Apr 7, 2025 at 2:46 PM Adrian Garcia Badaracco
>>> > <adr...@pydantic.dev.invalid> wrote:
>>> >
>>> > > Yes I am not able to see them. Could you make a PR to the repo, or
>>> upload
>>> > > them somewhere so we can make a PR? Even if it doesn’t get merged
>>> > > immediately we can pull them from the PR. Thanks!
>>> > >
>>> > > On Mon, Apr 7, 2025 at 1:44 PM Aihua Xu <aihu...@gmail.com> wrote:
>>> > >
>>> > > > Hi Adrian,
>>> > > >
>>> > > > I attached them to my reply and I'm not sure if the files get
>>> filtered.
>>> > > Let
>>> > > > me know if you still can't see them. Maybe I should push to the
>>> repo
>>> > > > instead.
>>> > > >
>>> > > > On Mon, Apr 7, 2025 at 10:58 AM Adrian Garcia Badaracco
>>> > > > <adr...@pydantic.dev.invalid> wrote:
>>> > > >
>>> > > > > Amazing Aihua, thanks so much!
>>> > > > >
>>> > > > > Sorry if I just missed it but... where are the files you
>>> created? I
>>> > > don't
>>> > > > > see them in the repo / issue / this thread.
>>> > > > >
>>> > > > > On Mon, Apr 7, 2025 at 12:54 PM Aihua Xu <aihu...@gmail.com>
>>> wrote:
>>> > > > >
>>> > > > > > I have created some test files attached during the
>>> development, one
>>> > > is
>>> > > > > > without shredding and one is with shredding.
>>> > > > > >
>>> > > > > > As David pointed out,  it's missing the Variant logical type
>>> but
>>> > you
>>> > > > can
>>> > > > > > use that as reference and as a start.
>>> > > > > >
>>> > > > > > On Mon, Apr 7, 2025 at 3:50 AM Andrew Lamb <
>>> andrewlam...@gmail.com
>>> > >
>>> > > > > wrote:
>>> > > > > >
>>> > > > > >> I filed a ticket to track this work[1] and also perhaps to
>>> gather
>>> > > some
>>> > > > > >> additional help / collaboration.
>>> > > > > >>
>>> > > > > >> [1]: https://github.com/apache/parquet-testing/issues/75
>>> > > > > >>
>>> > > > > >> On Mon, Apr 7, 2025 at 6:02 AM Andrew Lamb <
>>> > andrewlam...@gmail.com>
>>> > > > > >> wrote:
>>> > > > > >>
>>> > > > > >> > Thank you very much David
>>> > > > > >> >
>>> > > > > >> > I will try to create some examples this week and report
>>> back.
>>> > > > > >> >
>>> > > > > >> > Andrew
>>> > > > > >> >
>>> > > > > >> > On Sun, Apr 6, 2025 at 4:48 PM David Cashman
>>> > > > > >> > <david.cash...@databricks.com.invalid> wrote:
>>> > > > > >> >
>>> > > > > >> >> Hi Andrew, you should be able to create shredded files
>>> using
>>> > OSS
>>> > > > > Spark
>>> > > > > >> >> 4.0. I think the only issue is that it doesn't have the
>>> logical
>>> > > > type
>>> > > > > >> >> annotation yet, so readers wouldn't be able to distinguish
>>> it
>>> > > from
>>> > > > a
>>> > > > > >> >> non-variant struct that happens to have the same schema.
>>> (Spark
>>> > > is
>>> > > > > >> >> able to infer that it is a Variant from the
>>> > > > > >> >> `org.apache.spark.sql.parquet.row.metadata` metadata.)
>>> > > > > >> >>
>>> > > > > >> >> The ParquetVariantShreddingSuite in Spark has some tests
>>> that
>>> > > write
>>> > > > > >> >> and read shredded parquet files. Below is an example that
>>> > > > translates
>>> > > > > >> >> the first test into code that runs in spark-shell and
>>> writes a
>>> > > > > Parquet
>>> > > > > >> >> file. The shredding schema is set via conf. If you want to
>>> test
>>> > > > types
>>> > > > > >> >> that Spark doesn't infer in parse_json (e.g. timestamp,
>>> > binary),
>>> > > > you
>>> > > > > >> >> can use `to_variant_object` to cast structured values to
>>> > Variant.
>>> > > > > >> >>
>>> > > > > >> >> I won't have time to work on this in the next couple of
>>> weeks,
>>> > > but
>>> > > > am
>>> > > > > >> >> happy to answer any questions.
>>> > > > > >> >>
>>> > > > > >> >> Thanks,
>>> > > > > >> >>
>>> > > > > >> >> David
>>> > > > > >> >>
>>> > > > > >> >> scala> import org.apache.spark.sql.internal.SQLConf
>>> > > > > >> >> scala>
>>> > > spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key,
>>> > > > > >> true)
>>> > > > > >> >> scala>
>>> > spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key,
>>> > > > > true)
>>> > > > > >> >> scala>
>>> > > > > >>
>>> > spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key,
>>> > > > > >> >> "a int, b string, c decimal(15, 1)")
>>> > > > > >> >> scala> val df = spark.sql(
>>> > > > > >> >>      |       """
>>> > > > > >> >>      |         | select case
>>> > > > > >> >>      |         | when id = 0 then parse_json('{"a": 1, "b":
>>> > "2",
>>> > > > "c":
>>> > > > > >> >> 3.3, "d": 4.4}')
>>> > > > > >> >>      |         | when id = 1 then parse_json('{"a":
>>> [1,2,3],
>>> > "b":
>>> > > > > >> >> "hello", "c": {"x": 0}}')
>>> > > > > >> >>      |         | when id = 2 then parse_json('{"A": 1, "c":
>>> > > 1.23}')
>>> > > > > >> >>      |         | end v from range(3)
>>> > > > > >> >>      |         |""".stripMargin)
>>> > > > > >> >> scala>
>>> df.write.mode("overwrite").parquet("/tmp/shredded_test")
>>> > > > > >> >> scala> spark.read.parquet("/tmp/shredded_test").show
>>> > > > > >> >> +--------------------+
>>> > > > > >> >> |                   v|
>>> > > > > >> >> +--------------------+
>>> > > > > >> >> |{"a":1,"b":"2","c...|
>>> > > > > >> >> |{"a":[1,2,3],"b":...|
>>> > > > > >> >> |    {"A":1,"c":1.23}|
>>> > > > > >> >> +--------------------+
>>> > > > > >> >>
>>> > > > > >> >>
>>> > > > > >> >> On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb <
>>> > > andrewlam...@gmail.com
>>> > > > >
>>> > > > > >> >> wrote:
>>> > > > > >> >> >
>>> > > > > >> >> > Can someone (pretty pretty) please give us some binary
>>> > examples
>>> > > > so
>>> > > > > we
>>> > > > > >> >> can
>>> > > > > >> >> > make faster progress on the Rust implementation?
>>> > > > > >> >> >
>>> > > > > >> >> > We recently got exciting news[1] that folks from the CMU
>>> > > database
>>> > > > > >> group
>>> > > > > >> >> > have started working on the Rust implementation of
>>> variant,
>>> > > and I
>>> > > > > >> would
>>> > > > > >> >> > very much like to encourage and support their work.
>>> > > > > >> >> >
>>> > > > > >> >> > I am willing to do some legwork (make a PR to
>>> parquet-testing
>>> > > for
>>> > > > > >> >> example)
>>> > > > > >> >> > if someone can point me to the files (or instructions on
>>> how
>>> > to
>>> > > > use
>>> > > > > >> some
>>> > > > > >> >> > system to create variants).
>>> > > > > >> >> >
>>> > > > > >> >> > I was hoping that since the VARIANT format[2] and draft
>>> > > shredding
>>> > > > > >> >> spec[3]
>>> > > > > >> >> > have been in the repo for 6 months (since October 2024)
>>> , it
>>> > > > would
>>> > > > > be
>>> > > > > >> >> > straightforward to provide some examples. Do we know
>>> anything
>>> > > > that
>>> > > > > is
>>> > > > > >> >> > blocking the creation of examples?
>>> > > > > >> >> >
>>> > > > > >> >> > Andrew
>>> > > > > >> >> >
>>> > > > > >> >> > [1]:
>>> > > > > >> >>
>>> > > > >
>>> > https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103
>>> > > > > >> >> > [2]:
>>> > > > > >> >>
>>> > > > > >>
>>> > > >
>>> >
>>> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
>>> > > > > >> >> > [3]:
>>> > > > > >> >> >
>>> > > > > >> >>
>>> > > > > >>
>>> > > > >
>>> > >
>>> https://github.com/apache/parquet-format/blob/master/VariantShredding.md
>>> > > > > >> >> >
>>> > > > > >> >> >
>>> > > > > >> >> > On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem <
>>> > > jul...@apache.org>
>>> > > > > >> wrote:
>>> > > > > >> >> >
>>> > > > > >> >> > > That sounds like a great suggestion to me.
>>> > > > > >> >> > >
>>> > > > > >> >> > > On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb <
>>> > > > > >> andrewlam...@gmail.com>
>>> > > > > >> >> > > wrote:
>>> > > > > >> >> > >
>>> > > > > >> >> > > > I would like to request before the VARIANT spec
>>> changes
>>> > are
>>> > > > > >> >> finalized
>>> > > > > >> >> > > that
>>> > > > > >> >> > > > we have example data in parquet-testing.
>>> > > > > >> >> > > >
>>> > > > > >> >> > > > This topic came up (well, I brought it up) on the
>>> sync
>>> > call
>>> > > > > >> today.
>>> > > > > >> >> > > >
>>> > > > > >> >> > > > In my opinion, having example files would reduce the
>>> > > overhead
>>> > > > > of
>>> > > > > >> new
>>> > > > > >> >> > > > implementations dramatically. At least there should
>>> be
>>> > > > example
>>> > > > > of
>>> > > > > >> >> > > > * variant columns (no shredding)
>>> > > > > >> >> > > > * variant columns with shredding
>>> > > > > >> >> > > >
>>> > > > > >> >> > > > Some description of what those files contained
>>> ("expected
>>> > > > > >> >> contents"). For
>>> > > > > >> >> > > > prior art, here is what Dewey did for the geometry
>>> > > > type[1][2].
>>> > > > > >> >> > > >
>>> > > > > >> >> > > > When looking for prior discussions, I found a great
>>> quote
>>> > > > from
>>> > > > > >> Gang
>>> > > > > >> >> Wu[3]
>>> > > > > >> >> > > > on this topic:
>>> > > > > >> >> > > >
>>> > > > > >> >> > > > >  I'd say that a lesson learned is that we should
>>> > publish
>>> > > > > >> example
>>> > > > > >> >> files
>>> > > > > >> >> > > > for any
>>> > > > > >> >> > > > > new feature to the parquet-testing [1] repo for
>>> > > > > >> interoperability
>>> > > > > >> >> tests.
>>> > > > > >> >> > > >
>>> > > > > >> >> > > > Thank you for your consideration,
>>> > > > > >> >> > > > Andrew
>>> > > > > >> >> > > >
>>> > > > > >> >> > > >
>>> > > > > >> >> > > >
>>> > > > > >> >> > > >
>>> > > > > >> >> > > > [1]
>>> https://github.com/apache/parquet-testing/pull/70
>>> > > > > >> >> > > > [2] https://github.com/geoarrow/geoarrow-data
>>> > > > > >> >> > > > [3]:
>>> > > > > >> >>
>>> > https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt
>>> > > > > >> >> > > >
>>> > > > > >> >> > >
>>> > > > > >> >>
>>> > > > > >> >
>>> > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [DISCUSS] Example VARIANT parquet files

Reply via email to