> Goals: Improve performance and stability reading wide-schema Parquet files > (10K+ columns). This requires (1) faster access to column metadata in the > footer, and (2) reducing footer bloat.
I wonder if you can clarify what you mean by "footer bloat" As Raphael pointed out earlier, switching to use Flatbuffers in the footer seems to increase footer bloat, in the sense that files require more bytes to store the same data. This is true both: 1. In the period where two copies of the footer are present (flatbuffers and thrift) 2. Likely even for files that only use flatbuffers, given that thrift is a relatively compact encoding. So my point is that maybe we shouldn't convolve a discussion about footer bloat and the flatbuffers proposal, as it is already proving challenging to get some consensus. Andrew On Fri, Apr 10, 2026 at 10:48 AM Divjot Arora via dev < [email protected]> wrote: > Hi Will, > > Thanks for the reply. Some thoughts: > > > Consistency in how we evaluate adoption risk > > Format fragmentation and the double-write penalty > > The flatbuffer proposal uses the extension framework to write both footers > during the transitionary time. Given this, I think option 3 carries less > adoption risk than option 2: with this framework, readers that lack support > just see a Thrift footer and ignore the rest. By contrast, option 2 > produces files that existing parquet-java readers cannot parse at all. I > don’t feel the two-tier ecosystem is a major concern as there would be > proper comms and a deprecation period for Thrift before a breaking upgrade > to PAR3. There is no preference for “early adopters”; engines get locked > out only if they don’t upgrade during this whole period. > > > How much of the problem is the format vs. the implementations? > > You’re correct that there is a wide gap in the existing Thrift parsers and > there is likely room for improvement in raw parsing throughput for most/all > of the implementations. However, the biggest win from the flatbuffer > proposal comes from removing fields such as path_in_schema that cause > massive blowup in footer size. > > Expanding on the example I mentioned in the previous message: we observed > one footer in our production fleet that was 367 MB. With a jump table + > highly optimized Thrift parser: fetching the footer from cloud storage (~50 > MB/s) takes ~7 seconds; even assuming 200 MB/s with aggressive prefetching, > this is still almost 2 seconds. Assuming the jump table lookup and Thrift > parsing are free, this is still a long delay before the engine can read > data for the file. The path_in_schema field accounted for ~57% of the > footer, so with that removed, the footer is 157 MB and requires 0.8 - 3 > seconds to fetch. > > With option 3 (minimal flatbuf): the schema + column chunk placement > information account for ~11 MB of the total footer (~7% of the footer after > path_in_schema is removed). This would be appended to the file after the > Thrift footer, increasing file size by 3%. Fetching just this piece would > take 220 ms, a 3-13x improvement over the Thrift option, even with > path_in_schema removed. > > > Looking ahead > > I fully agree that we should pursue making path_in_schema optional for > Thrift and the jump table approach as these will greatly improve the > performance for existing workloads. > However, when fetch time alone takes seconds, no amount of parsing > optimization gets us where we need to be. > > Best, > Div >
