I've been wanting to get a Parquet Java release out so that we can use the
annotation in Iceberg. I'd support releasing what we have, which includes
the encoding library now. We may want to wait to get David's reader
implementation in, though. It is here:
https://github.com/apache/parquet-java/pull/3212/files

On Fri, May 9, 2025 at 3:09 AM Andrew Lamb <andrewlam...@gmail.com> wrote:

> Thanks Aihua,
>
> We are not (yet) blocked on development by the logical type annotation,
> though I am blocked from creating example parquet files
>
> Do you think it would be possible to create annotated parquet files using
> arrow-cpp now that Neil's PR[1] has been merged? I haven't really tried yet
>
> Andrew
>
> [1]: https://github.com/apache/arrow/pull/45375
>
> On Thu, May 8, 2025 at 9:59 PM Aihua Xu <aihu...@gmail.com> wrote:
>
> > Since we haven’t released any parquet version with variant logical type,
> I
> > don’t think any engine would write that yet.
> >
> > Are you getting blocked by this? Can you workaround the issue for now?
> >
> > Thanks
> > Aihua
> >
> >
> > > On May 8, 2025, at 12:57 PM, Andrew Lamb <andrewlam...@gmail.com>
> wrote:
> > >
> > > Update here: the initial example files[1] were merged (thanks Micah!)
> > >
> > > However, Spark 4.0 (and what is on the main branch) does not appear to
> > > write parquet files with the Logical annotations yet (it uses its own
> > > metadata it seems) -- you can see what I tried in [2].
> > >
> > > **Does anyone know of a system that can write Variant values with the
> > > proper Parquet logical type?**
> > >
> > > Thanks,
> > > Andrew
> > >
> > >
> > >
> > > [1]: https://github.com/apache/parquet-testing/pull/76
> > > [2]:
> > >
> >
> https://github.com/apache/parquet-testing/issues/75#issuecomment-2847815424
> > >
> > >
> > >
> > >> On Fri, May 2, 2025 at 2:22 PM Andrew Lamb <andrewlam...@gmail.com>
> > wrote:
> > >>
> > >> Thanks to Micah, I think we have the first PR with example Variant
> > >> values[1] almost ready to merge.
> > >>
> > >> Next up will be figuring out how to create Parquet files with the
> proper
> > >> logical annotations.
> > >>
> > >> Andrew
> > >>
> > >> p.s. In case it isn't obvious I would like introducing Variant into
> > >> Parquet to be a model of how to extend the spec and get wide adoption
> > >> across the ecosystem quickly, for two reasons:
> > >> 1.  the actual Variant funtionality
> > >> 2. To counteract the narrative that Parquet is ossified and not
> possible
> > >> to change.
> > >>
> > >> I personally think adding the binary examples is critical to
> > >> helping other language implementations.
> > >>
> > >> [1]: https://github.com/apache/parquet-testing/pull/76
> > >>
> > >> On Wed, Apr 16, 2025 at 10:59 AM Andrew Lamb <andrewlam...@gmail.com>
> > >> wrote:
> > >>
> > >>> Update here is I have created a PR with example variant values (not
> yet
> > >>> parquet files, just the variant values)[1].
> > >>>
> > >>> Since Spark seems to be the only open source software capable of
> > creating
> > >>> variants at this time, I generated the examples using Spark.
> > >>>
> > >>> Please check it out and let me know what you think. If it is
> > acceptable I
> > >>> can work on PRs (based on Aihua's example) for actual parquet files
> > with
> > >>> encoded values
> > >>>
> > >>> Andrew
> > >>>
> > >>> [1]: https://github.com/apache/parquet-testing/pull/76
> > >>>
> > >>> On Tue, Apr 8, 2025 at 5:46 AM Andrew Lamb <andrewlam...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Thank you very much
> > >>>>
> > >>>> On Mon, Apr 7, 2025 at 5:10 PM Aihua Xu <aihu...@gmail.com> wrote:
> > >>>>
> > >>>>> I attached in the
> > https://github.com/apache/parquet-testing/issues/75.
> > >>>>> Please rename them back to *.parquet so you can use parquet tools
> to
> > >>>>> view
> > >>>>> them.
> > >>>>>
> > >>>>> I captured them when working on the Iceberg tests  in
> > >>>>>
> > >>>>>
> >
> https://github.com/apache/iceberg/blob/main/parquet/src/test/java/org/apache/iceberg/parquet/TestVariantWriters.java#L172
> > >>>>> .
> > >>>>>
> > >>>>> You can change to  OutputFile outputFile =
> > >>>>> Files.localOutput("primitive.parquet"); to capture them, but you
> > >>>>> probably
> > >>>>> can follow what David mentioned.
> > >>>>>
> > >>>>> On Mon, Apr 7, 2025 at 12:26 PM Andrew Lamb <
> andrewlam...@gmail.com>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Attaching them on the ticket[1] would also be a way to share them
> > >>>>>>
> > >>>>>> It would also be super helpful to share the commands you ran
> > >>>>>>
> > >>>>>> Andrew
> > >>>>>>
> > >>>>>> [1]: https://github.com/apache/parquet-testing/issues/75
> > >>>>>>
> > >>>>>> On Mon, Apr 7, 2025 at 2:46 PM Adrian Garcia Badaracco
> > >>>>>> <adr...@pydantic.dev.invalid> wrote:
> > >>>>>>
> > >>>>>>> Yes I am not able to see them. Could you make a PR to the repo,
> or
> > >>>>> upload
> > >>>>>>> them somewhere so we can make a PR? Even if it doesn’t get merged
> > >>>>>>> immediately we can pull them from the PR. Thanks!
> > >>>>>>>
> > >>>>>>> On Mon, Apr 7, 2025 at 1:44 PM Aihua Xu <aihu...@gmail.com>
> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Adrian,
> > >>>>>>>>
> > >>>>>>>> I attached them to my reply and I'm not sure if the files get
> > >>>>> filtered.
> > >>>>>>> Let
> > >>>>>>>> me know if you still can't see them. Maybe I should push to the
> > >>>>> repo
> > >>>>>>>> instead.
> > >>>>>>>>
> > >>>>>>>> On Mon, Apr 7, 2025 at 10:58 AM Adrian Garcia Badaracco
> > >>>>>>>> <adr...@pydantic.dev.invalid> wrote:
> > >>>>>>>>
> > >>>>>>>>> Amazing Aihua, thanks so much!
> > >>>>>>>>>
> > >>>>>>>>> Sorry if I just missed it but... where are the files you
> > >>>>> created? I
> > >>>>>>> don't
> > >>>>>>>>> see them in the repo / issue / this thread.
> > >>>>>>>>>
> > >>>>>>>>> On Mon, Apr 7, 2025 at 12:54 PM Aihua Xu <aihu...@gmail.com>
> > >>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I have created some test files attached during the
> > >>>>> development, one
> > >>>>>>> is
> > >>>>>>>>>> without shredding and one is with shredding.
> > >>>>>>>>>>
> > >>>>>>>>>> As David pointed out,  it's missing the Variant logical type
> > >>>>> but
> > >>>>>> you
> > >>>>>>>> can
> > >>>>>>>>>> use that as reference and as a start.
> > >>>>>>>>>>
> > >>>>>>>>>> On Mon, Apr 7, 2025 at 3:50 AM Andrew Lamb <
> > >>>>> andrewlam...@gmail.com
> > >>>>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> I filed a ticket to track this work[1] and also perhaps to
> > >>>>> gather
> > >>>>>>> some
> > >>>>>>>>>>> additional help / collaboration.
> > >>>>>>>>>>>
> > >>>>>>>>>>> [1]: https://github.com/apache/parquet-testing/issues/75
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Mon, Apr 7, 2025 at 6:02 AM Andrew Lamb <
> > >>>>>> andrewlam...@gmail.com>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Thank you very much David
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I will try to create some examples this week and report
> > >>>>> back.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Andrew
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Sun, Apr 6, 2025 at 4:48 PM David Cashman
> > >>>>>>>>>>>> <david.cash...@databricks.com.invalid> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Andrew, you should be able to create shredded files
> > >>>>> using
> > >>>>>> OSS
> > >>>>>>>>> Spark
> > >>>>>>>>>>>>> 4.0. I think the only issue is that it doesn't have the
> > >>>>> logical
> > >>>>>>>> type
> > >>>>>>>>>>>>> annotation yet, so readers wouldn't be able to
> > >>>>> distinguish it
> > >>>>>>> from
> > >>>>>>>> a
> > >>>>>>>>>>>>> non-variant struct that happens to have the same schema.
> > >>>>> (Spark
> > >>>>>>> is
> > >>>>>>>>>>>>> able to infer that it is a Variant from the
> > >>>>>>>>>>>>> `org.apache.spark.sql.parquet.row.metadata` metadata.)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The ParquetVariantShreddingSuite in Spark has some tests
> > >>>>> that
> > >>>>>>> write
> > >>>>>>>>>>>>> and read shredded parquet files. Below is an example that
> > >>>>>>>> translates
> > >>>>>>>>>>>>> the first test into code that runs in spark-shell and
> > >>>>> writes a
> > >>>>>>>>> Parquet
> > >>>>>>>>>>>>> file. The shredding schema is set via conf. If you want
> > >>>>> to test
> > >>>>>>>> types
> > >>>>>>>>>>>>> that Spark doesn't infer in parse_json (e.g. timestamp,
> > >>>>>> binary),
> > >>>>>>>> you
> > >>>>>>>>>>>>> can use `to_variant_object` to cast structured values to
> > >>>>>> Variant.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I won't have time to work on this in the next couple of
> > >>>>> weeks,
> > >>>>>>> but
> > >>>>>>>> am
> > >>>>>>>>>>>>> happy to answer any questions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> David
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> scala> import org.apache.spark.sql.internal.SQLConf
> > >>>>>>>>>>>>> scala>
> > >>>>>>> spark.conf.set(SQLConf.VARIANT_WRITE_SHREDDING_ENABLED.key,
> > >>>>>>>>>>> true)
> > >>>>>>>>>>>>> scala>
> > >>>>>> spark.conf.set(SQLConf.VARIANT_ALLOW_READING_SHREDDED.key,
> > >>>>>>>>> true)
> > >>>>>>>>>>>>> scala>
> > >>>>>>>>>>>
> > >>>>>>
> spark.conf.set(SQLConf.VARIANT_FORCE_SHREDDING_SCHEMA_FOR_TEST.key,
> > >>>>>>>>>>>>> "a int, b string, c decimal(15, 1)")
> > >>>>>>>>>>>>> scala> val df = spark.sql(
> > >>>>>>>>>>>>>     |       """
> > >>>>>>>>>>>>>     |         | select case
> > >>>>>>>>>>>>>     |         | when id = 0 then parse_json('{"a": 1,
> > >>>>> "b":
> > >>>>>> "2",
> > >>>>>>>> "c":
> > >>>>>>>>>>>>> 3.3, "d": 4.4}')
> > >>>>>>>>>>>>>     |         | when id = 1 then parse_json('{"a":
> > >>>>> [1,2,3],
> > >>>>>> "b":
> > >>>>>>>>>>>>> "hello", "c": {"x": 0}}')
> > >>>>>>>>>>>>>     |         | when id = 2 then parse_json('{"A": 1,
> > >>>>> "c":
> > >>>>>>> 1.23}')
> > >>>>>>>>>>>>>     |         | end v from range(3)
> > >>>>>>>>>>>>>     |         |""".stripMargin)
> > >>>>>>>>>>>>> scala>
> > >>>>> df.write.mode("overwrite").parquet("/tmp/shredded_test")
> > >>>>>>>>>>>>> scala> spark.read.parquet("/tmp/shredded_test").show
> > >>>>>>>>>>>>> +--------------------+
> > >>>>>>>>>>>>> |                   v|
> > >>>>>>>>>>>>> +--------------------+
> > >>>>>>>>>>>>> |{"a":1,"b":"2","c...|
> > >>>>>>>>>>>>> |{"a":[1,2,3],"b":...|
> > >>>>>>>>>>>>> |    {"A":1,"c":1.23}|
> > >>>>>>>>>>>>> +--------------------+
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Sun, Apr 6, 2025 at 2:54 PM Andrew Lamb <
> > >>>>>>> andrewlam...@gmail.com
> > >>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Can someone (pretty pretty) please give us some binary
> > >>>>>> examples
> > >>>>>>>> so
> > >>>>>>>>> we
> > >>>>>>>>>>>>> can
> > >>>>>>>>>>>>>> make faster progress on the Rust implementation?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> We recently got exciting news[1] that folks from the CMU
> > >>>>>>> database
> > >>>>>>>>>>> group
> > >>>>>>>>>>>>>> have started working on the Rust implementation of
> > >>>>> variant,
> > >>>>>>> and I
> > >>>>>>>>>>> would
> > >>>>>>>>>>>>>> very much like to encourage and support their work.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I am willing to do some legwork (make a PR to
> > >>>>> parquet-testing
> > >>>>>>> for
> > >>>>>>>>>>>>> example)
> > >>>>>>>>>>>>>> if someone can point me to the files (or instructions
> > >>>>> on how
> > >>>>>> to
> > >>>>>>>> use
> > >>>>>>>>>>> some
> > >>>>>>>>>>>>>> system to create variants).
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I was hoping that since the VARIANT format[2] and draft
> > >>>>>>> shredding
> > >>>>>>>>>>>>> spec[3]
> > >>>>>>>>>>>>>> have been in the repo for 6 months (since October 2024)
> > >>>>> , it
> > >>>>>>>> would
> > >>>>>>>>> be
> > >>>>>>>>>>>>>> straightforward to provide some examples. Do we know
> > >>>>> anything
> > >>>>>>>> that
> > >>>>>>>>> is
> > >>>>>>>>>>>>>> blocking the creation of examples?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Andrew
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> [1]:
> > >>>>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>
> > >>>>>
> > https://github.com/apache/arrow-rs/issues/6736#issuecomment-2781556103
> > >>>>>>>>>>>>>> [2]:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>
> > https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
> > >>>>>>>>>>>>>> [3]:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>
> > https://github.com/apache/parquet-format/blob/master/VariantShredding.md
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Wed, Mar 5, 2025 at 3:58 PM Julien Le Dem <
> > >>>>>>> jul...@apache.org>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> That sounds like a great suggestion to me.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Wed, Mar 5, 2025 at 12:41 PM Andrew Lamb <
> > >>>>>>>>>>> andrewlam...@gmail.com>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I would like to request before the VARIANT spec
> > >>>>> changes
> > >>>>>> are
> > >>>>>>>>>>>>> finalized
> > >>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>> we have example data in parquet-testing.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> This topic came up (well, I brought it up) on the
> > >>>>> sync
> > >>>>>> call
> > >>>>>>>>>>> today.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> In my opinion, having example files would reduce the
> > >>>>>>> overhead
> > >>>>>>>>> of
> > >>>>>>>>>>> new
> > >>>>>>>>>>>>>>>> implementations dramatically. At least there should
> > >>>>> be
> > >>>>>>>> example
> > >>>>>>>>> of
> > >>>>>>>>>>>>>>>> * variant columns (no shredding)
> > >>>>>>>>>>>>>>>> * variant columns with shredding
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Some description of what those files contained
> > >>>>> ("expected
> > >>>>>>>>>>>>> contents"). For
> > >>>>>>>>>>>>>>>> prior art, here is what Dewey did for the geometry
> > >>>>>>>> type[1][2].
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> When looking for prior discussions, I found a great
> > >>>>> quote
> > >>>>>>>> from
> > >>>>>>>>>>> Gang
> > >>>>>>>>>>>>> Wu[3]
> > >>>>>>>>>>>>>>>> on this topic:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I'd say that a lesson learned is that we should
> > >>>>>> publish
> > >>>>>>>>>>> example
> > >>>>>>>>>>>>> files
> > >>>>>>>>>>>>>>>> for any
> > >>>>>>>>>>>>>>>>> new feature to the parquet-testing [1] repo for
> > >>>>>>>>>>> interoperability
> > >>>>>>>>>>>>> tests.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thank you for your consideration,
> > >>>>>>>>>>>>>>>> Andrew
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> [1]
> > >>>>> https://github.com/apache/parquet-testing/pull/70
> > >>>>>>>>>>>>>>>> [2] https://github.com/geoarrow/geoarrow-data
> > >>>>>>>>>>>>>>>> [3]:
> > >>>>>>>>>>>>>
> > >>>>>> https://lists.apache.org/thread/71d7p9lprhf514jnt5dgnw4wfmn8ykzt
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> >
>

Reply via email to