Re: [DISCUSS] Example VARIANT parquet files

Andrew Lamb Fri, 09 May 2025 06:46:31 -0700

Thanks Aihua,

We are not (yet) blocked on development by the logical type annotation,
though I am blocked from creating example parquet files


Do you think it would be possible to create annotated parquet files using
arrow-cpp now that Neil's PR[1] has been merged? I haven't really tried yet

Andrew

[1]: https://github.com/apache/arrow/pull/45375

On Thu, May 8, 2025 at 9:59 PM Aihua Xu <aihu...@gmail.com> wrote:

> Since we haven’t released any parquet version with variant logical type, I
> don’t think any engine would write that yet.
>
> Are you getting blocked by this? Can you workaround the issue for now?
>
> Thanks
> Aihua
>
>
> > On May 8, 2025, at 12:57 PM, Andrew Lamb <andrewlam...@gmail.com> wrote:
> >
> > Update here: the initial example files[1] were merged (thanks Micah!)
> >
> > However, Spark 4.0 (and what is on the main branch) does not appear to
> > write parquet files with the Logical annotations yet (it uses its own
> > metadata it seems) -- you can see what I tried in [2].
> >
> > **Does anyone know of a system that can write Variant values with the
> > proper Parquet logical type?**
> >
> > Thanks,
> > Andrew
> >
> >
> >
> > [1]: https://github.com/apache/parquet-testing/pull/76
> > [2]:
> >
> https://github.com/apache/parquet-testing/issues/75#issuecomment-2847815424
> >
> >
> >
> >> On Fri, May 2, 2025 at 2:22 PM Andrew Lamb <andrewlam...@gmail.com>
> wrote:
> >>
> >> Thanks to Micah, I think we have the first PR with example Variant
> >> values[1] almost ready to merge.
> >>
> >> Next up will be figuring out how to create Parquet files with the proper
> >> logical annotations.
> >>
> >> Andrew
> >>
> >> p.s. In case it isn't obvious I would like introducing Variant into
> >> Parquet to be a model of how to extend the spec and get wide adoption
> >> across the ecosystem quickly, for two reasons:
> >> 1.  the actual Variant funtionality
> >> 2. To counteract the narrative that Parquet is ossified and not possible
> >> to change.
> >>
> >> I personally think adding the binary examples is critical to
> >> helping other language implementations.
> >>
> >> [1]: https://github.com/apache/parquet-testing/pull/76
> >>
> >> On Wed, Apr 16, 2025 at 10:59 AM Andrew Lamb <andrewlam...@gmail.com>
> >> wrote:
> >>
> >>> Update here is I have created a PR with example variant values (not yet
> >>> parquet files, just the variant values)[1].
> >>>
> >>> Since Spark seems to be the only open source software capable of
> creating
> >>> variants at this time, I generated the examples using Spark.
> >>>
> >>> Please check it out and let me know what you think. If it is
> acceptable I
> >>> can work on PRs (based on Aihua's example) for actual parquet files
> with
> >>> encoded values
> >>>
> >>> Andrew
> >>>
> >>> [1]: https://github.com/apache/parquet-testing/pull/76
> >>>
> >>> On Tue, Apr 8, 2025 at 5:46 AM Andrew Lamb <andrewlam...@gmail.com>
> >>> wrote:
> >>>
> >>>> Thank you very much
> >>>>
> >>>> On Mon, Apr 7, 2025 at 5:10 PM Aihua Xu <aihu...@gmail.com> wrote:
> >>>>
> >>>>> I attached in the
> https://github.com/apache/parquet-testing/issues/75.
> >>>>> Please rename them back to *.parquet so you can use parquet tools to
> >>>>> view
> >>>>> them.
> >>>>>
> >>>>> I captured them when working on the Iceberg tests  in
> >>>>>
> >>>>>
> https://github.com/apache/iceberg/blob/main/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java#L172
> >>>>> .
> >>>>>
> >>>>> You can change to  OutputFile outputFile =
> >>>>> Files.localOutput("primitive.parquet"); to capture them, but you
> >>>>> probably
> >>>>> can follow what David mentioned.
> >>>>>
> >>>>> On Mon, Apr 7, 2025 at 12:26 PM Andrew Lamb <andrewlam...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Attaching them on the ticket[1] would also be a way to share them
> >>>>>>
> >>>>>> It would also be super helpful to share the commands you ran
> >>>>>>
> >>>>>> Andrew
> >>>>>>
> >>>>>> [1]: https://github.com/apache/parquet-testing/issues/75
> >>>>>>
> >>>>>> On Mon, Apr 7, 2025 at 2:46 PM Adrian Garcia Badaracco
> >>>>>> <adr...@pydantic.dev.invalid> wrote:
> >>>>>>
> >>>>>>> Yes I am not able to see them. Could you make a PR to the repo, or
> >>>>> upload
> >>>>>>> them somewhere so we can make a PR? Even if it doesn’t get merged
> >>>>>>> immediately we can pull them from the PR. Thanks!
> >>>>>>>
> >>>>>>> On Mon, Apr 7, 2025 at 1:44 PM Aihua Xu <aihu...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Hi Adrian,
> >>>>>>>>
> >>>>>>>> I attached them to my reply and I'm not sure if the files get
> >>>>> filtered.
> >>>>>>> Let
> >>>>>>>> me know if you still can't see them. Maybe I should push to the
> >>>>> repo
> >>>>>>>> instead.
> >>>>>>>>
> >>>>>>>> On Mon, Apr 7, 2025 at 10:58 AM Adrian Garcia Badaracco
> >>>>>>>> <adr...@pydantic.dev.invalid> wrote:
> >>>>>>>>
> >>>>>>>>> Amazing Aihua, thanks so much!
> >>>>>>>>>
> >>>>>>>>> Sorry if I just missed it but... where are the files you
> >>>>> created? I
> >>>>>>> don't
> >>>>>>>>> see them in the repo / issue / this thread.
> >>>>>>>>>
> >>>>>>>>> On Mon, Apr 7, 2025 at 12:54 PM Aihua Xu <aihu...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I have created some test files attached during the
> >>>>> development, one
> >>>>>>> is
> >>>>>>>>>> without shredding and one is with shredding.
> >>>>>>>>>>
> >>>>>>>>>> As David pointed out,  it's missing the Variant logical type
> >>>>> but
> >>>>>> you
> >>>>>>>> can
> >>>>>>>>>> use that as reference and as a start.
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Apr 7, 2025 at 3:50 AM Andrew Lamb <
> >>>>> andrewlam...@gmail.com
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I filed a ticket to track this work[1] and also perhaps to
> >>>>> gather
> >>>>>>> some
> >>>>>>>>>>> additional help / collaboration.
> >>>>>>>>>>>
> >>>>>>>>>>> [1]: https://github.com/apache/parquet-testing/issues/75
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Apr 7, 2025 at 6:02 AM Andrew Lamb <
> >>>>>> andrewlam...@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thank you very much David
> >>>>>>>>>>>>
> >>>>>>>>>>>> I will try to create some examples this week and report
> >>>>> back.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Andrew
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Sun, Apr 6, 2025 at 4:48 PM David Cashman
> >>>>>>>>>>>> <david.cash...@databricks.com.invalid> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Andrew, you should be able to create shredded files
> >>>>> using
> >>>>>> OSS
> >>>>>>>>> Spark
> >>>>>>>>>>>>> 4.0. I think the only issue is that it doesn't have the
> >>>>> logical
> >>>>>>>> type
> >>>>>>>>>>>>> annotation yet, so readers wouldn't be able to
> >>>>> distinguish it
> >>>>>>> from
> >>>>>>>> a
> >>>>>>>>>>>>> non-variant struct that happens to have the same schema.
> >>>>> (Spark
> >>>>>>> is
> >>>>>>>>>>>>> able to infer that it is a Variant from the
> >>>>>>>>>>>>> `org.apache.spark.sql.parquet.row.metadata` metadata.)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The ParquetVariantShreddingSuite in Spark has some tests
> >>>>> that
> >>>>>>> write
> >>>>>>>>>>>>> and read shredded parquet files. Below is an example that
> >>>>>>>> translates
> >>>>>>>>>>>>> the first test into code that runs in spark-shell and
> >>>>> writes a
> >>>>>>>>> Parquet
> >>>>>>>>>>>>> file. The shredding schema is set via conf. If you want
> >>>>> to test
> >>>>>>>> types
> >>>>>>>>>>>>> that Spark doesn't infer in parse_json (e.g. timestamp,
> >>>>>> binary),
> >>>>>>>> you
> >>>>>>>>>>>>> can use `to_variant_object` to cast structured values to
> >>>>>> Variant.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I won't have time to work on this in the next couple of
> >>>>> weeks,
> >>>>>>> but
> >>>>>>>> am
> >>>>>>>>>>>>> happy to answer any questions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> David
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> scala> import org.apache.spark.sql.internal.SQLConf
> >>>>>>>>>>>>> scala>
> >>>>>>> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key,
> >>>>>>>>>>> true)
> >>>>>>>>>>>>> scala>
> >>>>>> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key,
> >>>>>>>>> true)
> >>>>>>>>>>>>> scala>
> >>>>>>>>>>>
> >>>>>> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key,
> >>>>>>>>>>>>> "a int, b string, c decimal(15, 1)")
> >>>>>>>>>>>>> scala> val df = spark.sql(
> >>>>>>>>>>>>>     |       """
> >>>>>>>>>>>>>     |         | select case
> >>>>>>>>>>>>>     |         | when id = 0 then parse_json('{"a": 1,
> >>>>> "b":
> >>>>>> "2",
> >>>>>>>> "c":
> >>>>>>>>>>>>> 3.3, "d": 4.4}')
> >>>>>>>>>>>>>     |         | when id = 1 then parse_json('{"a":
> >>>>> [1,2,3],
> >>>>>> "b":
> >>>>>>>>>>>>> "hello", "c": {"x": 0}}')
> >>>>>>>>>>>>>     |         | when id = 2 then parse_json('{"A": 1,
> >>>>> "c":
> >>>>>>> 1.23}')
> >>>>>>>>>>>>>     |         | end v from range(3)
> >>>>>>>>>>>>>     |         |""".stripMargin)
> >>>>>>>>>>>>> scala>
> >>>>> df.write.mode("overwrite").parquet("/tmp/shredded_test")
> >>>>>>>>>>>>> scala> spark.read.parquet("/tmp/shredded_test").show
> >>>>>>>>>>>>> +--------------------+
> >>>>>>>>>>>>> |                   v|
> >>>>>>>>>>>>> +--------------------+
> >>>>>>>>>>>>> |{"a":1,"b":"2","c...|
> >>>>>>>>>>>>> |{"a":[1,2,3],"b":...|
> >>>>>>>>>>>>> |    {"A":1,"c":1.23}|
> >>>>>>>>>>>>> +--------------------+
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb <
> >>>>>>> andrewlam...@gmail.com
> >>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Can someone (pretty pretty) please give us some binary
> >>>>>> examples
> >>>>>>>> so
> >>>>>>>>> we
> >>>>>>>>>>>>> can
> >>>>>>>>>>>>>> make faster progress on the Rust implementation?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> We recently got exciting news[1] that folks from the CMU
> >>>>>>> database
> >>>>>>>>>>> group
> >>>>>>>>>>>>>> have started working on the Rust implementation of
> >>>>> variant,
> >>>>>>> and I
> >>>>>>>>>>> would
> >>>>>>>>>>>>>> very much like to encourage and support their work.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I am willing to do some legwork (make a PR to
> >>>>> parquet-testing
> >>>>>>> for
> >>>>>>>>>>>>> example)
> >>>>>>>>>>>>>> if someone can point me to the files (or instructions
> >>>>> on how
> >>>>>> to
> >>>>>>>> use
> >>>>>>>>>>> some
> >>>>>>>>>>>>>> system to create variants).
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I was hoping that since the VARIANT format[2] and draft
> >>>>>>> shredding
> >>>>>>>>>>>>> spec[3]
> >>>>>>>>>>>>>> have been in the repo for 6 months (since October 2024)
> >>>>> , it
> >>>>>>>> would
> >>>>>>>>> be
> >>>>>>>>>>>>>> straightforward to provide some examples. Do we know
> >>>>> anything
> >>>>>>>> that
> >>>>>>>>> is
> >>>>>>>>>>>>>> blocking the creation of examples?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Andrew
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> [1]:
> >>>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>
> https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103
> >>>>>>>>>>>>>> [2]:
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> >>>>>>>>>>>>>> [3]:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> https://github.com/apache/parquet-format/blob/master/VariantShredding.md
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem <
> >>>>>>> jul...@apache.org>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> That sounds like a great suggestion to me.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb <
> >>>>>>>>>>> andrewlam...@gmail.com>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I would like to request before the VARIANT spec
> >>>>> changes
> >>>>>> are
> >>>>>>>>>>>>> finalized
> >>>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> we have example data in parquet-testing.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This topic came up (well, I brought it up) on the
> >>>>> sync
> >>>>>> call
> >>>>>>>>>>> today.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> In my opinion, having example files would reduce the
> >>>>>>> overhead
> >>>>>>>>> of
> >>>>>>>>>>> new
> >>>>>>>>>>>>>>>> implementations dramatically. At least there should
> >>>>> be
> >>>>>>>> example
> >>>>>>>>> of
> >>>>>>>>>>>>>>>> * variant columns (no shredding)
> >>>>>>>>>>>>>>>> * variant columns with shredding
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Some description of what those files contained
> >>>>> ("expected
> >>>>>>>>>>>>> contents"). For
> >>>>>>>>>>>>>>>> prior art, here is what Dewey did for the geometry
> >>>>>>>> type[1][2].
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> When looking for prior discussions, I found a great
> >>>>> quote
> >>>>>>>> from
> >>>>>>>>>>> Gang
> >>>>>>>>>>>>> Wu[3]
> >>>>>>>>>>>>>>>> on this topic:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'd say that a lesson learned is that we should
> >>>>>> publish
> >>>>>>>>>>> example
> >>>>>>>>>>>>> files
> >>>>>>>>>>>>>>>> for any
> >>>>>>>>>>>>>>>>> new feature to the parquet-testing [1] repo for
> >>>>>>>>>>> interoperability
> >>>>>>>>>>>>> tests.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thank you for your consideration,
> >>>>>>>>>>>>>>>> Andrew
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> [1]
> >>>>> https://github.com/apache/parquet-testing/pull/70
> >>>>>>>>>>>>>>>> [2] https://github.com/geoarrow/geoarrow-data
> >>>>>>>>>>>>>>>> [3]:
> >>>>>>>>>>>>>
> >>>>>> https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
>

Re: [DISCUSS] Example VARIANT parquet files

Reply via email to