Re: [DISCUSS] Example VARIANT parquet files

Aihua Xu Thu, 08 May 2025 18:59:27 -0700

Since we haven’t released any parquet version with variant logical type, I 
don’t think any engine would write that yet.


Are you getting blocked by this? Can you workaround the issue for now?

Thanks
Aihua


> On May 8, 2025, at 12:57 PM, Andrew Lamb <andrewlam...@gmail.com> wrote:
> 
> Update here: the initial example files[1] were merged (thanks Micah!)
> 
> However, Spark 4.0 (and what is on the main branch) does not appear to
> write parquet files with the Logical annotations yet (it uses its own
> metadata it seems) -- you can see what I tried in [2].
> 
> **Does anyone know of a system that can write Variant values with the
> proper Parquet logical type?**
> 
> Thanks,
> Andrew
> 
> 
> 
> [1]: https://github.com/apache/parquet-testing/pull/76
> [2]:
> https://github.com/apache/parquet-testing/issues/75#issuecomment-2847815424
> 
> 
> 
>> On Fri, May 2, 2025 at 2:22 PM Andrew Lamb <andrewlam...@gmail.com> wrote:
>> 
>> Thanks to Micah, I think we have the first PR with example Variant
>> values[1] almost ready to merge.
>> 
>> Next up will be figuring out how to create Parquet files with the proper
>> logical annotations.
>> 
>> Andrew
>> 
>> p.s. In case it isn't obvious I would like introducing Variant into
>> Parquet to be a model of how to extend the spec and get wide adoption
>> across the ecosystem quickly, for two reasons:
>> 1.  the actual Variant funtionality
>> 2. To counteract the narrative that Parquet is ossified and not possible
>> to change.
>> 
>> I personally think adding the binary examples is critical to
>> helping other language implementations.
>> 
>> [1]: https://github.com/apache/parquet-testing/pull/76
>> 
>> On Wed, Apr 16, 2025 at 10:59 AM Andrew Lamb <andrewlam...@gmail.com>
>> wrote:
>> 
>>> Update here is I have created a PR with example variant values (not yet
>>> parquet files, just the variant values)[1].
>>> 
>>> Since Spark seems to be the only open source software capable of creating
>>> variants at this time, I generated the examples using Spark.
>>> 
>>> Please check it out and let me know what you think. If it is acceptable I
>>> can work on PRs (based on Aihua's example) for actual parquet files with
>>> encoded values
>>> 
>>> Andrew
>>> 
>>> [1]: https://github.com/apache/parquet-testing/pull/76
>>> 
>>> On Tue, Apr 8, 2025 at 5:46 AM Andrew Lamb <andrewlam...@gmail.com>
>>> wrote:
>>> 
>>>> Thank you very much
>>>> 
>>>> On Mon, Apr 7, 2025 at 5:10 PM Aihua Xu <aihu...@gmail.com> wrote:
>>>> 
>>>>> I attached in the https://github.com/apache/parquet-testing/issues/75.
>>>>> Please rename them back to *.parquet so you can use parquet tools to
>>>>> view
>>>>> them.
>>>>> 
>>>>> I captured them when working on the Iceberg tests  in
>>>>> 
>>>>> https://github.com/apache/iceberg/blob/main/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java#L172
>>>>> .
>>>>> 
>>>>> You can change to  OutputFile outputFile =
>>>>> Files.localOutput("primitive.parquet"); to capture them, but you
>>>>> probably
>>>>> can follow what David mentioned.
>>>>> 
>>>>> On Mon, Apr 7, 2025 at 12:26 PM Andrew Lamb <andrewlam...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Attaching them on the ticket[1] would also be a way to share them
>>>>>> 
>>>>>> It would also be super helpful to share the commands you ran
>>>>>> 
>>>>>> Andrew
>>>>>> 
>>>>>> [1]: https://github.com/apache/parquet-testing/issues/75
>>>>>> 
>>>>>> On Mon, Apr 7, 2025 at 2:46 PM Adrian Garcia Badaracco
>>>>>> <adr...@pydantic.dev.invalid> wrote:
>>>>>> 
>>>>>>> Yes I am not able to see them. Could you make a PR to the repo, or
>>>>> upload
>>>>>>> them somewhere so we can make a PR? Even if it doesn’t get merged
>>>>>>> immediately we can pull them from the PR. Thanks!
>>>>>>> 
>>>>>>> On Mon, Apr 7, 2025 at 1:44 PM Aihua Xu <aihu...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Hi Adrian,
>>>>>>>> 
>>>>>>>> I attached them to my reply and I'm not sure if the files get
>>>>> filtered.
>>>>>>> Let
>>>>>>>> me know if you still can't see them. Maybe I should push to the
>>>>> repo
>>>>>>>> instead.
>>>>>>>> 
>>>>>>>> On Mon, Apr 7, 2025 at 10:58 AM Adrian Garcia Badaracco
>>>>>>>> <adr...@pydantic.dev.invalid> wrote:
>>>>>>>> 
>>>>>>>>> Amazing Aihua, thanks so much!
>>>>>>>>> 
>>>>>>>>> Sorry if I just missed it but... where are the files you
>>>>> created? I
>>>>>>> don't
>>>>>>>>> see them in the repo / issue / this thread.
>>>>>>>>> 
>>>>>>>>> On Mon, Apr 7, 2025 at 12:54 PM Aihua Xu <aihu...@gmail.com>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I have created some test files attached during the
>>>>> development, one
>>>>>>> is
>>>>>>>>>> without shredding and one is with shredding.
>>>>>>>>>> 
>>>>>>>>>> As David pointed out,  it's missing the Variant logical type
>>>>> but
>>>>>> you
>>>>>>>> can
>>>>>>>>>> use that as reference and as a start.
>>>>>>>>>> 
>>>>>>>>>> On Mon, Apr 7, 2025 at 3:50 AM Andrew Lamb <
>>>>> andrewlam...@gmail.com
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I filed a ticket to track this work[1] and also perhaps to
>>>>> gather
>>>>>>> some
>>>>>>>>>>> additional help / collaboration.
>>>>>>>>>>> 
>>>>>>>>>>> [1]: https://github.com/apache/parquet-testing/issues/75
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Apr 7, 2025 at 6:02 AM Andrew Lamb <
>>>>>> andrewlam...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Thank you very much David
>>>>>>>>>>>> 
>>>>>>>>>>>> I will try to create some examples this week and report
>>>>> back.
>>>>>>>>>>>> 
>>>>>>>>>>>> Andrew
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sun, Apr 6, 2025 at 4:48 PM David Cashman
>>>>>>>>>>>> <david.cash...@databricks.com.invalid> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Andrew, you should be able to create shredded files
>>>>> using
>>>>>> OSS
>>>>>>>>> Spark
>>>>>>>>>>>>> 4.0. I think the only issue is that it doesn't have the
>>>>> logical
>>>>>>>> type
>>>>>>>>>>>>> annotation yet, so readers wouldn't be able to
>>>>> distinguish it
>>>>>>> from
>>>>>>>> a
>>>>>>>>>>>>> non-variant struct that happens to have the same schema.
>>>>> (Spark
>>>>>>> is
>>>>>>>>>>>>> able to infer that it is a Variant from the
>>>>>>>>>>>>> `org.apache.spark.sql.parquet.row.metadata` metadata.)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The ParquetVariantShreddingSuite in Spark has some tests
>>>>> that
>>>>>>> write
>>>>>>>>>>>>> and read shredded parquet files. Below is an example that
>>>>>>>> translates
>>>>>>>>>>>>> the first test into code that runs in spark-shell and
>>>>> writes a
>>>>>>>>> Parquet
>>>>>>>>>>>>> file. The shredding schema is set via conf. If you want
>>>>> to test
>>>>>>>> types
>>>>>>>>>>>>> that Spark doesn't infer in parse_json (e.g. timestamp,
>>>>>> binary),
>>>>>>>> you
>>>>>>>>>>>>> can use `to_variant_object` to cast structured values to
>>>>>> Variant.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I won't have time to work on this in the next couple of
>>>>> weeks,
>>>>>>> but
>>>>>>>> am
>>>>>>>>>>>>> happy to answer any questions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> David
>>>>>>>>>>>>> 
>>>>>>>>>>>>> scala> import org.apache.spark.sql.internal.SQLConf
>>>>>>>>>>>>> scala>
>>>>>>> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key,
>>>>>>>>>>> true)
>>>>>>>>>>>>> scala>
>>>>>> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key,
>>>>>>>>> true)
>>>>>>>>>>>>> scala>
>>>>>>>>>>> 
>>>>>> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key,
>>>>>>>>>>>>> "a int, b string, c decimal(15, 1)")
>>>>>>>>>>>>> scala> val df = spark.sql(
>>>>>>>>>>>>>     |       """
>>>>>>>>>>>>>     |         | select case
>>>>>>>>>>>>>     |         | when id = 0 then parse_json('{"a": 1,
>>>>> "b":
>>>>>> "2",
>>>>>>>> "c":
>>>>>>>>>>>>> 3.3, "d": 4.4}')
>>>>>>>>>>>>>     |         | when id = 1 then parse_json('{"a":
>>>>> [1,2,3],
>>>>>> "b":
>>>>>>>>>>>>> "hello", "c": {"x": 0}}')
>>>>>>>>>>>>>     |         | when id = 2 then parse_json('{"A": 1,
>>>>> "c":
>>>>>>> 1.23}')
>>>>>>>>>>>>>     |         | end v from range(3)
>>>>>>>>>>>>>     |         |""".stripMargin)
>>>>>>>>>>>>> scala>
>>>>> df.write.mode("overwrite").parquet("/tmp/shredded_test")
>>>>>>>>>>>>> scala> spark.read.parquet("/tmp/shredded_test").show
>>>>>>>>>>>>> +--------------------+
>>>>>>>>>>>>> |                   v|
>>>>>>>>>>>>> +--------------------+
>>>>>>>>>>>>> |{"a":1,"b":"2","c...|
>>>>>>>>>>>>> |{"a":[1,2,3],"b":...|
>>>>>>>>>>>>> |    {"A":1,"c":1.23}|
>>>>>>>>>>>>> +--------------------+
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb <
>>>>>>> andrewlam...@gmail.com
>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Can someone (pretty pretty) please give us some binary
>>>>>> examples
>>>>>>>> so
>>>>>>>>> we
>>>>>>>>>>>>> can
>>>>>>>>>>>>>> make faster progress on the Rust implementation?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> We recently got exciting news[1] that folks from the CMU
>>>>>>> database
>>>>>>>>>>> group
>>>>>>>>>>>>>> have started working on the Rust implementation of
>>>>> variant,
>>>>>>> and I
>>>>>>>>>>> would
>>>>>>>>>>>>>> very much like to encourage and support their work.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am willing to do some legwork (make a PR to
>>>>> parquet-testing
>>>>>>> for
>>>>>>>>>>>>> example)
>>>>>>>>>>>>>> if someone can point me to the files (or instructions
>>>>> on how
>>>>>> to
>>>>>>>> use
>>>>>>>>>>> some
>>>>>>>>>>>>>> system to create variants).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I was hoping that since the VARIANT format[2] and draft
>>>>>>> shredding
>>>>>>>>>>>>> spec[3]
>>>>>>>>>>>>>> have been in the repo for 6 months (since October 2024)
>>>>> , it
>>>>>>>> would
>>>>>>>>> be
>>>>>>>>>>>>>> straightforward to provide some examples. Do we know
>>>>> anything
>>>>>>>> that
>>>>>>>>> is
>>>>>>>>>>>>>> blocking the creation of examples?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [1]:
>>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>> https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103
>>>>>>>>>>>>>> [2]:
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
>>>>>>>>>>>>>> [3]:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> https://github.com/apache/parquet-format/blob/master/VariantShredding.md
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem <
>>>>>>> jul...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> That sounds like a great suggestion to me.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb <
>>>>>>>>>>> andrewlam...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I would like to request before the VARIANT spec
>>>>> changes
>>>>>> are
>>>>>>>>>>>>> finalized
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> we have example data in parquet-testing.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This topic came up (well, I brought it up) on the
>>>>> sync
>>>>>> call
>>>>>>>>>>> today.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> In my opinion, having example files would reduce the
>>>>>>> overhead
>>>>>>>>> of
>>>>>>>>>>> new
>>>>>>>>>>>>>>>> implementations dramatically. At least there should
>>>>> be
>>>>>>>> example
>>>>>>>>> of
>>>>>>>>>>>>>>>> * variant columns (no shredding)
>>>>>>>>>>>>>>>> * variant columns with shredding
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Some description of what those files contained
>>>>> ("expected
>>>>>>>>>>>>> contents"). For
>>>>>>>>>>>>>>>> prior art, here is what Dewey did for the geometry
>>>>>>>> type[1][2].
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> When looking for prior discussions, I found a great
>>>>> quote
>>>>>>>> from
>>>>>>>>>>> Gang
>>>>>>>>>>>>> Wu[3]
>>>>>>>>>>>>>>>> on this topic:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I'd say that a lesson learned is that we should
>>>>>> publish
>>>>>>>>>>> example
>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>> for any
>>>>>>>>>>>>>>>>> new feature to the parquet-testing [1] repo for
>>>>>>>>>>> interoperability
>>>>>>>>>>>>> tests.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thank you for your consideration,
>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> [1]
>>>>> https://github.com/apache/parquet-testing/pull/70
>>>>>>>>>>>>>>>> [2] https://github.com/geoarrow/geoarrow-data
>>>>>>>>>>>>>>>> [3]:
>>>>>>>>>>>>> 
>>>>>> https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>

Re: [DISCUSS] Example VARIANT parquet files

Reply via email to