Hi Divjot, Thanks for the clear summary of options.
* A couple of thoughts on the framing are worth surfacing before the community converges on a direction: 1. Consistency in how we evaluate adoption risk Option 2 rightly calls out that making path_in_schema optional is an incompatible change—parquet-java can't parse files without it today, and Spark, EMR, Fabric, etc., would need time to upgrade. That's a real concern. But I think we should apply the same lens equally to Options 3 and 4. A file whose footer metadata lives in FlatBuffers is entirely unreadable by every Parquet reader deployed today, not just incompatible with one field. The adoption timeline for a completely new serialization format across the full ecosystem (parquet-java, parquet-mr, arrow-rs, PyArrow, DuckDB, Trino, etc.) is significantly longer than for a single optional field. If we are weighing adoption cost as a factor against Option 2, it should weigh far more heavily against Options 3 and 4. 2. Format fragmentation and the double-write penalty Options 3 and 4 both imply a multi-year transition window that risks format fragmentation. To maintain backward compatibility during this transition, writers will be forced to emit both the FlatBuffer and Thrift footers. For wide-table files, this double-write penalty will actually make the metadata bloat problem much worse before it gets better. Furthermore, this extended dual-footer period creates a two-tier ecosystem where some engines are "more equal" than others. Files could go from "works best in one stack" to "only works in one stack" as early adopters eventually drop Thrift writing altogether. Many organizations chose Parquet specifically to avoid vendor lock-in; we should be careful not to reintroduce it through a transition plan that structurally advantages early movers. Additionally, this imposes a significant ongoing maintenance burden on library authors, who will need to support both metadata paths indefinitely. * How much of the problem is the format vs. the implementations? The wide performance gap between existing Thrift footer parsers suggests there is significant headroom to be gained in code rather than in format changes. Before we commit to a format-level break, I think it's worth asking how much of that headroom we've actually exhausted. A jump table gives us O(1) access. Smarter writers (omitting useless statistics) reduce bloat at the source. Better parsers close the raw throughput gap. These are changes that benefit every file already in the wild today. * Looking ahead Regarding Option 3: If we have O(1) access into the Thrift footer via a jump table, the benefit of a minimal FlatBuffer alongside it is small—limited to queries that need to touch many or all columns when all the fields are in the FlatBuffer—and hard to justify against the cost of dual footers and the maintenance burden of a format transition. I could imagine the future as: readers making path_in_schema optional soon, but writers continuing to emit it for the foreseeable future—because interoperability is not a nice-to-have. Parquet is the lingua franca of the data ecosystem precisely because any tool can read any file. That property is worth protecting, and I'm keen that we exhaust the available improvements within the current format before risking its fragmentation. Cheers, Will On Fri, 10 Apr 2026 at 10:55, Divjot Arora via dev <[email protected]> wrote: > Hi folks, > > I just realized the table did not render very well, apologies for that. > Please ignore it, it's just a condensed version of the text. > > -- Divjot Arora > > On Thu, Apr 9, 2026 at 6:25 PM Divjot Arora <[email protected]> > wrote: > > > Hi all, > > > > Following up from the previous mailing list thread [1] about alternative > > options to the flatbuffer footer proposal [2]. > > > > Goals: Improve performance and stability reading wide-schema Parquet > files > > (10K+ columns). This requires (1) faster access to column metadata in the > > footer, and (2) reducing footer bloat. For example, path_in_schema causes > > quadratic size blowup with deeply nested schemas - we've seen production > > files with 300 MB+ footers, almost 60% of which was path_in_schema alone > > (see the linked original flatbuf proposal for an example). > > > > Background: PR #544 proposes a FlatBuffer-based footer written alongside > > the existing Thrift one via the extension framework. A recent mailing > list > > thread proposes an alternative: leave the Thrift footer as-is and add an > > optional "jump table" index for O(1) access to individual column chunks. > > > > Note: the jump table is complementary to any FlatBuffer approach — it > > benefits existing files regardless of the path we take for new ones. > > > > Options: > > > > 1. Jump table only. Add an optional index into the existing Thrift footer > > - Pros: Simplest approach, no incompatible changes, minimal file size > > increase, solves faster access > > - Cons: Does not address footer bloat. For huge footers, O(1) seek helps > > but the entire footer must still be fetched > > > > 2. Jump table + targeted Thrift fixes. Add the jump table and fix the > > worst bloat sources (e.g., make path_in_schema optional). > > - Pros: Minimal incompatible changes that address both goals. > > - Cons: parquet-java cannot parse files with empty path_in_schema. The > > code change is easy, but this cannot be in effect immediately as > > open-source Spark > > and downstream offerings(EMR, Fabric, etc) would need to upgrade. > > > > 3. Minimal FlatBuffer footer. New footer with just schema + column chunk > > placement. Statistics, page indexes, etc. added as optional modules over > > time and don’t necessarily > > need to live in the footer. For the pathological footer case in the > > flatbuf proposal, the schema and column chunk placement information > account > > for only 3% of the full footer size. > > Pros: Smallest incremental step toward a redesigned footer. Addresses > both > > goals long-term and allows for an incremental redesign of all fields, > > not just the most obvious ones. Performance-sensitive engines can > leverage > > the new footer immediately. > > Cons: Both footers written during transition, increasing file size. > > Engines that need statistics can't drop the Thrift footer until those > > modules ship, > > so near-term benefit is limited. > > > > 4. Full FlatBuffer footer. Finalize the FlatBuffer design with all fields > > from the Thrift footer. The two evolve in lockstep until a format version > > bump drops Thrift. > > Pros: Addresses both goals and fully redesigns all footer components. > > Cons: Largest scope. PR #544 has already generated extended design > debate, > > we risk stalling and preventing any win until the full proposal is agreed > > upon. > > > > Summary Table: > > > > > > Option 1: Jump Table Only > > > > Option 2: Jump Table + Thrift Fixes > > > > Option 3: Minimal FlatBuffer > > > > Option 4: Full FlatBuffer > > > > What > > > > Optional index for O(1) access into existing Thrift footer > > > > Jump table + make worst bloat sources optional (e.g. path_in_schema) > > > > New footer with schema + column placement only; stats/indexes added later > > > > Complete FlatBuffer replacement for all Thrift footer fields > > > > Faster access > > > > Yes > > > > Yes > > > > Yes > > > > Yes > > > > Reduces bloat > > > > No > > > > Yes > > > > Yes (long-term) > > > > Yes > > > > Incompatible changes > > > > None > > > > Medium, is a breaking format change > > > > Dual-write during transition > > > > Dual-write, eventual format version bump > > > > File size impact > > > > Minimal increase > > > > Minimal increase > > > > Increases (two footers) until Thrift dropped > > > > Increases (two footers) until Thrift dropped > > > > Scope / risk > > > > Simplest > > > > Small > > > > Medium > > > > Largest — risk of stalling on design debate > > > > *Main downside* > > > > Entire bloated footer still fetched > > > > parquet-java can't parse empty path_in_schema; needs upstream upgrades > > across Spark/EMR/Fabric > > > > Engines needing stats can't drop Thrift until stat modules ship > > > > Extended design debate (PR #544) may block any near-term wins > > > > > > [1] https://lists.apache.org/thread/czm2bk45wwtkhhpqxqvmx9dk5wkwk1kt > > [2] > > > https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.0#heading=h.ccu4zzsy0tm5 > > >
