Re: [DISCUSS] Example VARIANT parquet files

Andrew Lamb Wed, 16 Apr 2025 07:59:40 -0700

Update here is I have created a PR with example variant values (not yet
parquet files, just the variant values)[1].


Since Spark seems to be the only open source software capable of creating
variants at this time, I generated the examples using Spark.

Please check it out and let me know what you think. If it is acceptable I
can work on PRs (based on Aihua's example) for actual parquet files with
encoded values

Andrew

[1]: https://github.com/apache/parquet-testing/pull/76

On Tue, Apr 8, 2025 at 5:46 AM Andrew Lamb <[email protected]> wrote:

> Thank you very much
>
> On Mon, Apr 7, 2025 at 5:10 PM Aihua Xu <[email protected]> wrote:
>
>> I attached in the https://github.com/apache/parquet-testing/issues/75.
>> Please rename them back to *.parquet so you can use parquet tools to view
>> them.
>>
>> I captured them when working on the Iceberg tests  in
>>
>> https://github.com/apache/iceberg/blob/main/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java#L172
>> .
>>
>> You can change to  OutputFile outputFile =
>> Files.localOutput("primitive.parquet"); to capture them, but you probably
>> can follow what David mentioned.
>>
>> On Mon, Apr 7, 2025 at 12:26 PM Andrew Lamb <[email protected]>
>> wrote:
>>
>> > Attaching them on the ticket[1] would also be a way to share them
>> >
>> > It would also be super helpful to share the commands you ran
>> >
>> > Andrew
>> >
>> > [1]: https://github.com/apache/parquet-testing/issues/75
>> >
>> > On Mon, Apr 7, 2025 at 2:46 PM Adrian Garcia Badaracco
>> > <[email protected]> wrote:
>> >
>> > > Yes I am not able to see them. Could you make a PR to the repo, or
>> upload
>> > > them somewhere so we can make a PR? Even if it doesn’t get merged
>> > > immediately we can pull them from the PR. Thanks!
>> > >
>> > > On Mon, Apr 7, 2025 at 1:44 PM Aihua Xu <[email protected]> wrote:
>> > >
>> > > > Hi Adrian,
>> > > >
>> > > > I attached them to my reply and I'm not sure if the files get
>> filtered.
>> > > Let
>> > > > me know if you still can't see them. Maybe I should push to the repo
>> > > > instead.
>> > > >
>> > > > On Mon, Apr 7, 2025 at 10:58 AM Adrian Garcia Badaracco
>> > > > <[email protected]> wrote:
>> > > >
>> > > > > Amazing Aihua, thanks so much!
>> > > > >
>> > > > > Sorry if I just missed it but... where are the files you created?
>> I
>> > > don't
>> > > > > see them in the repo / issue / this thread.
>> > > > >
>> > > > > On Mon, Apr 7, 2025 at 12:54 PM Aihua Xu <[email protected]>
>> wrote:
>> > > > >
>> > > > > > I have created some test files attached during the development,
>> one
>> > > is
>> > > > > > without shredding and one is with shredding.
>> > > > > >
>> > > > > > As David pointed out,  it's missing the Variant logical type but
>> > you
>> > > > can
>> > > > > > use that as reference and as a start.
>> > > > > >
>> > > > > > On Mon, Apr 7, 2025 at 3:50 AM Andrew Lamb <
>> [email protected]
>> > >
>> > > > > wrote:
>> > > > > >
>> > > > > >> I filed a ticket to track this work[1] and also perhaps to
>> gather
>> > > some
>> > > > > >> additional help / collaboration.
>> > > > > >>
>> > > > > >> [1]: https://github.com/apache/parquet-testing/issues/75
>> > > > > >>
>> > > > > >> On Mon, Apr 7, 2025 at 6:02 AM Andrew Lamb <
>> > [email protected]>
>> > > > > >> wrote:
>> > > > > >>
>> > > > > >> > Thank you very much David
>> > > > > >> >
>> > > > > >> > I will try to create some examples this week and report back.
>> > > > > >> >
>> > > > > >> > Andrew
>> > > > > >> >
>> > > > > >> > On Sun, Apr 6, 2025 at 4:48 PM David Cashman
>> > > > > >> > <[email protected]> wrote:
>> > > > > >> >
>> > > > > >> >> Hi Andrew, you should be able to create shredded files using
>> > OSS
>> > > > > Spark
>> > > > > >> >> 4.0. I think the only issue is that it doesn't have the
>> logical
>> > > > type
>> > > > > >> >> annotation yet, so readers wouldn't be able to distinguish
>> it
>> > > from
>> > > > a
>> > > > > >> >> non-variant struct that happens to have the same schema.
>> (Spark
>> > > is
>> > > > > >> >> able to infer that it is a Variant from the
>> > > > > >> >> `org.apache.spark.sql.parquet.row.metadata` metadata.)
>> > > > > >> >>
>> > > > > >> >> The ParquetVariantShreddingSuite in Spark has some tests
>> that
>> > > write
>> > > > > >> >> and read shredded parquet files. Below is an example that
>> > > > translates
>> > > > > >> >> the first test into code that runs in spark-shell and
>> writes a
>> > > > > Parquet
>> > > > > >> >> file. The shredding schema is set via conf. If you want to
>> test
>> > > > types
>> > > > > >> >> that Spark doesn't infer in parse_json (e.g. timestamp,
>> > binary),
>> > > > you
>> > > > > >> >> can use `to_variant_object` to cast structured values to
>> > Variant.
>> > > > > >> >>
>> > > > > >> >> I won't have time to work on this in the next couple of
>> weeks,
>> > > but
>> > > > am
>> > > > > >> >> happy to answer any questions.
>> > > > > >> >>
>> > > > > >> >> Thanks,
>> > > > > >> >>
>> > > > > >> >> David
>> > > > > >> >>
>> > > > > >> >> scala> import org.apache.spark.sql.internal.SQLConf
>> > > > > >> >> scala>
>> > > spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key,
>> > > > > >> true)
>> > > > > >> >> scala>
>> > spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key,
>> > > > > true)
>> > > > > >> >> scala>
>> > > > > >>
>> > spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key,
>> > > > > >> >> "a int, b string, c decimal(15, 1)")
>> > > > > >> >> scala> val df = spark.sql(
>> > > > > >> >>      |       """
>> > > > > >> >>      |         | select case
>> > > > > >> >>      |         | when id = 0 then parse_json('{"a": 1, "b":
>> > "2",
>> > > > "c":
>> > > > > >> >> 3.3, "d": 4.4}')
>> > > > > >> >>      |         | when id = 1 then parse_json('{"a": [1,2,3],
>> > "b":
>> > > > > >> >> "hello", "c": {"x": 0}}')
>> > > > > >> >>      |         | when id = 2 then parse_json('{"A": 1, "c":
>> > > 1.23}')
>> > > > > >> >>      |         | end v from range(3)
>> > > > > >> >>      |         |""".stripMargin)
>> > > > > >> >> scala>
>> df.write.mode("overwrite").parquet("/tmp/shredded_test")
>> > > > > >> >> scala> spark.read.parquet("/tmp/shredded_test").show
>> > > > > >> >> +--------------------+
>> > > > > >> >> |                   v|
>> > > > > >> >> +--------------------+
>> > > > > >> >> |{"a":1,"b":"2","c...|
>> > > > > >> >> |{"a":[1,2,3],"b":...|
>> > > > > >> >> |    {"A":1,"c":1.23}|
>> > > > > >> >> +--------------------+
>> > > > > >> >>
>> > > > > >> >>
>> > > > > >> >> On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb <
>> > > [email protected]
>> > > > >
>> > > > > >> >> wrote:
>> > > > > >> >> >
>> > > > > >> >> > Can someone (pretty pretty) please give us some binary
>> > examples
>> > > > so
>> > > > > we
>> > > > > >> >> can
>> > > > > >> >> > make faster progress on the Rust implementation?
>> > > > > >> >> >
>> > > > > >> >> > We recently got exciting news[1] that folks from the CMU
>> > > database
>> > > > > >> group
>> > > > > >> >> > have started working on the Rust implementation of
>> variant,
>> > > and I
>> > > > > >> would
>> > > > > >> >> > very much like to encourage and support their work.
>> > > > > >> >> >
>> > > > > >> >> > I am willing to do some legwork (make a PR to
>> parquet-testing
>> > > for
>> > > > > >> >> example)
>> > > > > >> >> > if someone can point me to the files (or instructions on
>> how
>> > to
>> > > > use
>> > > > > >> some
>> > > > > >> >> > system to create variants).
>> > > > > >> >> >
>> > > > > >> >> > I was hoping that since the VARIANT format[2] and draft
>> > > shredding
>> > > > > >> >> spec[3]
>> > > > > >> >> > have been in the repo for 6 months (since October 2024) ,
>> it
>> > > > would
>> > > > > be
>> > > > > >> >> > straightforward to provide some examples. Do we know
>> anything
>> > > > that
>> > > > > is
>> > > > > >> >> > blocking the creation of examples?
>> > > > > >> >> >
>> > > > > >> >> > Andrew
>> > > > > >> >> >
>> > > > > >> >> > [1]:
>> > > > > >> >>
>> > > > >
>> > https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103
>> > > > > >> >> > [2]:
>> > > > > >> >>
>> > > > > >>
>> > > >
>> > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
>> > > > > >> >> > [3]:
>> > > > > >> >> >
>> > > > > >> >>
>> > > > > >>
>> > > > >
>> > >
>> https://github.com/apache/parquet-format/blob/master/VariantShredding.md
>> > > > > >> >> >
>> > > > > >> >> >
>> > > > > >> >> > On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem <
>> > > [email protected]>
>> > > > > >> wrote:
>> > > > > >> >> >
>> > > > > >> >> > > That sounds like a great suggestion to me.
>> > > > > >> >> > >
>> > > > > >> >> > > On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb <
>> > > > > >> [email protected]>
>> > > > > >> >> > > wrote:
>> > > > > >> >> > >
>> > > > > >> >> > > > I would like to request before the VARIANT spec
>> changes
>> > are
>> > > > > >> >> finalized
>> > > > > >> >> > > that
>> > > > > >> >> > > > we have example data in parquet-testing.
>> > > > > >> >> > > >
>> > > > > >> >> > > > This topic came up (well, I brought it up) on the sync
>> > call
>> > > > > >> today.
>> > > > > >> >> > > >
>> > > > > >> >> > > > In my opinion, having example files would reduce the
>> > > overhead
>> > > > > of
>> > > > > >> new
>> > > > > >> >> > > > implementations dramatically. At least there should be
>> > > > example
>> > > > > of
>> > > > > >> >> > > > * variant columns (no shredding)
>> > > > > >> >> > > > * variant columns with shredding
>> > > > > >> >> > > >
>> > > > > >> >> > > > Some description of what those files contained
>> ("expected
>> > > > > >> >> contents"). For
>> > > > > >> >> > > > prior art, here is what Dewey did for the geometry
>> > > > type[1][2].
>> > > > > >> >> > > >
>> > > > > >> >> > > > When looking for prior discussions, I found a great
>> quote
>> > > > from
>> > > > > >> Gang
>> > > > > >> >> Wu[3]
>> > > > > >> >> > > > on this topic:
>> > > > > >> >> > > >
>> > > > > >> >> > > > >  I'd say that a lesson learned is that we should
>> > publish
>> > > > > >> example
>> > > > > >> >> files
>> > > > > >> >> > > > for any
>> > > > > >> >> > > > > new feature to the parquet-testing [1] repo for
>> > > > > >> interoperability
>> > > > > >> >> tests.
>> > > > > >> >> > > >
>> > > > > >> >> > > > Thank you for your consideration,
>> > > > > >> >> > > > Andrew
>> > > > > >> >> > > >
>> > > > > >> >> > > >
>> > > > > >> >> > > >
>> > > > > >> >> > > >
>> > > > > >> >> > > > [1] https://github.com/apache/parquet-testing/pull/70
>> > > > > >> >> > > > [2] https://github.com/geoarrow/geoarrow-data
>> > > > > >> >> > > > [3]:
>> > > > > >> >>
>> > https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt
>> > > > > >> >> > > >
>> > > > > >> >> > >
>> > > > > >> >>
>> > > > > >> >
>> > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Example VARIANT parquet files

Reply via email to