Parquet Sync Notes 2026-04-08 (as Next parquet sync conflicts with the Iceberg summit)

Andrew Lamb Wed, 08 Apr 2026 13:30:12 -0700

Hello,

While we missed Julien's magnetic personality I do think we had a good sync
today. Notes are below.

I will also follow up with sending a note about the Variant JSON parser
which seems to have not made it through

Andrew

Attendees:
Andrew Lamb - InfluxData - Listening In
Robert Kruszewski - Spiral - Listening In
Divjot Arora - Databricks - flatbuf footer
Alkis Evlogimenos - Databricks - flatbuf footer
Will Edwards - Spotify - Listening in
Martin Prammer - CMU - Listening in
Connor Tsui - Spiral/Vortex - Listening in
Jiayi Wang - Databricks - flatbuf footer
Steve Loughran, variant perf, and has to leave at 18:30
{attendee}: {org}, {topic}
…
Gaurav Miglani: Zepto, Listening
Dusan Paripovic, RTE, listening in
Vivek Jhaver, Independent, Listening in

Agenda:
Variant Performance (Steve Loughran)
Using a fixed structure (structure does not vary from rows)
Has been benchmarking Parquet Variant in Java using Spark 4.x with Iceberg.
Found that Avro is 2x faster than Parquet variant
Asks: please help review some improvements that Steve has found
https://github.com/apache/iceberg/pull/15629
https://github.com/apache/parquet-java/pull/3452
Would help to have some real world example datasets of Variant to test with
(this benchmark was using synthetic datasets)
Slides:
https://github.com/user-attachments/files/26575558/2026-04-01-variant.reads.considered.suboptimal.2.pdf

Flatbuf Footer
Discussions on the mailing list about speeding up current footer rather
than a flat buffer footer.
Flatbuf is trying to solve:
O(1) lookup from a schema with a large number of columns
Remove some of the not so great decisions about information in the footer
the most egregious one is path_in_schema that has quadratic behavior for
nested schemas
Statistics
Converted types / logical types
An alternate proposal was to add some sort of of optional index to find
where the column chunks starts in O(1) time
One challenge of avoiding path_in_schema is that parquet-mr and fabric both
use parquet-mr so require that field to read parquet files
Question: How do we move forward?
Is it really feasible to stop adding path_in_schema as it will take a while
for the ecosystem to upgrade to not?
It is not clear that we have ever had a breaking change to the parquet
format so that path_in_schema.
** Action: Divjot is going to summarize the options we have for handling
many columns on the mailing list as a way to try and get consensus and push
it forward

PR flow:

waiting for a decision on dependencies for variant in:
https://github.com/apache/parquet-java/pull/3415,
Sent mail([email protected]) , Mail title: [DISCUSS] Where should
VariantJsonParser live? (GH-3414)
Can not seem to find the mail in archives
https://lists.apache.org/[email protected]
Message content is below, we’ll make sure it gets to the list

``` I'm working on PR #3415 (GH-3414) which adds a parseJson method to
VariantBuilder for JSON-to-Variant conversion. During review,
@steveloughran raised a good question about where the JSON parsing logic
  should live, and suggested bringing it to the dev list.

  Currently, VariantJsonParser is in the parquet-variant module. This
requires adding jackson-core as a compile-scope dependency (for
JsonFactory, JsonParser, StreamReadConstraints, etc.) alongside
  parquet-jackson at runtime scope — the same pattern used by
parquet-hadoop.

  Steve suggested moving VariantJsonParser into parquet-jackson instead.

  Option A: Keep in parquet-variant (current approach)

  - Keeps all variant logic in one module
  - Follows the existing parquet-hadoop pattern for Jackson dependencies

  - Requires jackson-core at compile scope + parquet-jackson at runtime
scope

  Option B: Move to parquet-jackson
  - Avoids adding Jackson compile dependency to parquet-variant

  - parquet-jackson already has unshaded Jackson internally, so no
dependency issue

  - Requires parquet-jackson to depend on parquet-variant (for
VariantBuilder, Variant, etc.)
  - No build cycle as far as I can tell — parquet-variant doesn't currently
depend on parquet-jackson at compile scope

  For reference:

  - PR: https://github.com/apache/parquet-java/pull/3415

  - Similar approach in Rust:
https://github.com/apache/arrow-rs/blob/d3c79006f2595e144d539f56b3054fe916ab184b/parquet-variant-compute/src/from_json.rs#L47```

Notes:
Many community members are at the Iceberg summit
Variant PRs
https://github.com/apache/iceberg/pull/15629
https://github.com/apache/parquet-java/pull/3452

On Wed, Apr 8, 2026 at 1:00 PM Andrew Lamb <[email protected]> wrote:

> The community sync is starting now. The URL to join us is:
> https://meet.google.com/gvu-yxxs-jvg?authuser=0
>
>
> On Wed, Apr 8, 2026 at 6:32 AM Steve Loughran <[email protected]> wrote:
>
>> i do have some (bad) news about parquet variant file read performance,
>> but have my own commitments.
>>
>> I will put up a detailed gist covering this. For now know: shredded
>> variant performance is really bad. I had hoped to talk about the
>> iceberg-level issues last week, hopefully I will get space on the agenda
>> next time.
>>
>> At the parquet level, it's here are some benchmarks comparing shredded
>> and unshredded files. ignore the numbers, just look at the line lengths.
>>
>>
>>    1. graph 1: reading all the data in the variant. shredded is slower
>>    2. graph 2: reading some of the columns, using the parquet schema of
>>    the file. unshreadded is faster
>>    3. graph 3. reading that same subset of columns, but now with a
>>    "lean" schema that explicitly asks fo r
>>
>>
>> [image: Screenshot 2026-04-01 at 16.38.33.png]
>>
>>
>> Schema for graph 2; the one used to create the file
>>   public static final String UNSHREDDED_SCHEMA = "message vschema {"
>>       + "required int64 id;"
>>       + "required int32 category;"
>>       + "optional group nested (VARIANT(1)) {"
>>       + "  required binary metadata;"
>>       + "  required binary value;"
>>       + "  }"
>>       + "}";
>>
>> Schema for graph 3, which explicitly expects the shredded values and
>> declares the typed_value struct with the single shredded field "varcolumn"
>> which we want.
>>
>>   public static final String SELECT_SCHEMA = "message vschema {"
>>       + "required int64 id;"
>>       + "required int32 category;"
>>       + "optional group nested (VARIANT(1)) {"
>>       + "  required binary metadata;"
>>       + "  optional binary value;"
>>       + "  optional group typed_value {"
>>       + "    required group varcategory {"
>>       + "      optional binary value;"
>>       + "      optional int32 typed_value;"
>>       + "      }"
>>       + "    }"
>>       + "  }"
>>       + "}";
>>
>>
>> Like I said, I'll do a gist. I am now doing some profiling and should be
>> able to cut out a buffer -> string -> buffer conversion sequence which
>> takes place, simply by having VariantBuilder add a package private operation
>>
>>   void appendAsString(Binary binary) {
>>     onAppend();
>>     writeUTF8bytes(binary.getBytesUnsafe());
>>   }
>>
>> The current conversion spread acrosss two methods is effectively
>>   binary.toStringUsingUTF8().getBytes(StandardCharsets.UTF_8);
>> this shows up on the profile flamegraphs because of the memory
>> operations. Assuming strings are common in variants, thls should help.
>>
>> It'd be interesting to know
>>
>>    1. the structure of variants people are currently storing
>>    2. any queries which are being made of their contents, both filtering
>>    and projection.
>>
>>
>>
>> On Tue, 7 Apr 2026 at 21:53, Julien Le Dem <[email protected]> wrote:
>>
>>> Thank you!
>>>
>>> On Tue, Apr 7, 2026 at 12:23 PM Andrew Lamb <[email protected]>
>>> wrote:
>>>
>>> > I can help facilitate the meeting tomorrow.
>>> >
>>> > On Tue, Apr 7, 2026 at 3:13 PM Julien Le Dem <[email protected]>
>>> wrote:
>>> >
>>> > > Please reply by end of day to volunteer to facilitate the meeting
>>> > tomorrow.
>>> > > Otherwise, I'll cancel it.
>>> > >
>>> > > On Mon, Apr 6, 2026 at 8:55 AM Julien Le Dem <[email protected]>
>>> wrote:
>>> > >
>>> > > > Hello all,
>>> > > > The next Parquet sync on Wednesday is conflicting with the Iceberg
>>> > > summit.
>>> > > > (10am PT - 1pm ET - 7pm CET)
>>> > > > I will not be able to facilitate the meeting and I suspect some of
>>> the
>>> > > > regular attendees will be at the conference.
>>> > > > Is there a volunteer to facilitate the meeting? (basically, just
>>> some
>>> > > time
>>> > > > management and making sure notes are taken)
>>> > > > Otherwise, we can also skip this one and reconvene in 2 weeks.
>>> > > > Best,
>>> > > > Julien
>>> > > >
>>> > >
>>> >
>>>
>>

Parquet Sync Notes 2026-04-08 (as Next parquet sync conflicts with the Iceberg summit)

Reply via email to