Re: Next parquet sync conflicts with the Iceberg summit

Steve Loughran Wed, 08 Apr 2026 03:32:42 -0700

i do have some (bad) news about parquet variant file read performance, but
have my own commitments.

I will put up a detailed gist covering this. For now know: shredded variant
performance is really bad. I had hoped to talk about the iceberg-level
issues last week, hopefully I will get space on the agenda next time.

At the parquet level, it's here are some benchmarks comparing shredded and
unshredded files. ignore the numbers, just look at the line lengths.

   1. graph 1: reading all the data in the variant. shredded is slower
   2. graph 2: reading some of the columns, using the parquet schema of the
   file. unshreadded is faster
   3. graph 3. reading that same subset of columns, but now with a "lean"
   schema that explicitly asks fo r

[image: Screenshot 2026-04-01 at 16.38.33.png]

Schema for graph 2; the one used to create the file
  public static final String UNSHREDDED_SCHEMA = "message vschema {"
      + "required int64 id;"
      + "required int32 category;"
      + "optional group nested (VARIANT(1)) {"
      + "  required binary metadata;"
      + "  required binary value;"
      + "  }"
      + "}";

Schema for graph 3, which explicitly expects the shredded values and
declares the typed_value struct with the single shredded field "varcolumn"
which we want.

  public static final String SELECT_SCHEMA = "message vschema {"
      + "required int64 id;"
      + "required int32 category;"
      + "optional group nested (VARIANT(1)) {"
      + "  required binary metadata;"
      + "  optional binary value;"
      + "  optional group typed_value {"
      + "    required group varcategory {"
      + "      optional binary value;"
      + "      optional int32 typed_value;"
      + "      }"
      + "    }"
      + "  }"
      + "}";

Like I said, I'll do a gist. I am now doing some profiling and should be
able to cut out a buffer -> string -> buffer conversion sequence which
takes place, simply by having VariantBuilder add a package private operation

  void appendAsString(Binary binary) {
    onAppend();
    writeUTF8bytes(binary.getBytesUnsafe());
  }

The current conversion spread acrosss two methods is effectively
  binary.toStringUsingUTF8().getBytes(StandardCharsets.UTF_8);
this shows up on the profile flamegraphs because of the memory operations.
Assuming strings are common in variants, thls should help.

It'd be interesting to know

   1. the structure of variants people are currently storing
   2. any queries which are being made of their contents, both filtering
   and projection.

On Tue, 7 Apr 2026 at 21:53, Julien Le Dem <[email protected]> wrote:

> Thank you!
>
> On Tue, Apr 7, 2026 at 12:23 PM Andrew Lamb <[email protected]>
> wrote:
>
> > I can help facilitate the meeting tomorrow.
> >
> > On Tue, Apr 7, 2026 at 3:13 PM Julien Le Dem <[email protected]> wrote:
> >
> > > Please reply by end of day to volunteer to facilitate the meeting
> > tomorrow.
> > > Otherwise, I'll cancel it.
> > >
> > > On Mon, Apr 6, 2026 at 8:55 AM Julien Le Dem <[email protected]>
> wrote:
> > >
> > > > Hello all,
> > > > The next Parquet sync on Wednesday is conflicting with the Iceberg
> > > summit.
> > > > (10am PT - 1pm ET - 7pm CET)
> > > > I will not be able to facilitate the meeting and I suspect some of
> the
> > > > regular attendees will be at the conference.
> > > > Is there a volunteer to facilitate the meeting? (basically, just some
> > > time
> > > > management and making sure notes are taken)
> > > > Otherwise, we can also skip this one and reconvene in 2 weeks.
> > > > Best,
> > > > Julien
> > > >
> > >
> >
>

Re: Next parquet sync conflicts with the Iceberg summit

Reply via email to