I'm excited about the proposal to switch to Parquet as the manifest format for v4 of Iceberg. This change, which would include supporting Avro manifests from v1-v3 for table upgrades, looks like a great move.
It aligns perfectly with the v4 column statistics proposal we discussed at today's community sync. Using Parquet also simplifies the v4 implementation and should lead to performance gains and a smaller metadata storage footprint. Thanks, Russell, for leading this proposal and building the prototype! Best, Anoop On Wed, Aug 6, 2025 at 12:51 AM Sreeram Garlapati <gsreeramku...@gmail.com> wrote: > +1 > This will be a great progression for iceberg format allowing efficient > metadata pruning. pl. count me in. > > On Tue, Jun 17, 2025 at 3:45 AM Jacky Lee <qcsd2...@gmail.com> wrote: > >> Count me in. This solution effectively addresses the small files issue >> caused by high-frequency writes in our scenario, and it also greatly >> benefits the generation of partition- and table-level statistics. >> >> <mlhsmode...@gmail.com> 于2025年6月14日周六 07:04写道: >> > >> > I'm interested in working on this change as well. I think it pairs >> nicely with the proposal for per column structs for statistics. >> > >> > Thanks, >> > Harman >> > >> > On Thu, Jun 12, 2025 at 9:43 PM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >> >> >> It’s not required at compile time, only at test runtime. >> >> >> >> On Thu, Jun 12, 2025 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>> >> >>> > All we have to do is add the parquet module as a test dependency, >> working on a poc now. >> >>> >> >>> This will be a circular dependency on the core module. That's why I >> suggested abstracting out the test cases and executing them in a parquet >> module. Partition stats writing (as parquet) from the core module uses >> `InternalData` and does the same now. So, I guess it will be a similar work >> (but on a larger scale due to testcase refactoring). >> >>> >> >>> Let me know the results of your POC and happy to collaborate on this >> work. >> >>> >> >>> >> >>> - Ajantha >> >>> >> >>> On Fri, Jun 13, 2025 at 3:16 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>>> >> >>>> All we have to do is add the parquet module as a test dependency, >> working on a poc now. I don't think we really need to block on any other >> projects although I'll probably hold off on any work on manifest-list since >> I hope it won't be needed. >> >>>> >> >>>> On Thu, May 29, 2025 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> >> wrote: >> >>>>> >> >>>>> I am interested in working on this proposal. >> >>>>> I would assume it is to use `InternalData` with the format as >> `parquet`. But the challenge will be the test cases, the core module cannot >> write the parquet metadata due to circular dependency. We need to abstract >> out the test cases in the core module and run them from the parquet module >> I guess. >> >>>>> >> >>>>> I can work on a design doc as well. So, add me as a collaborator >> for the document. >> >>>>> But should this work be done after we complete the work on "single >> file commit in v4" ? because metadata structure can change? >> >>>>> >> >>>>> - Ajantha >> >>>>> >> >>>>> On Thu, May 29, 2025 at 11:37 PM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>>>>> >> >>>>>> Hi Y'all >> >>>>>> >> >>>>>> As discussed in the last community sync, we are beginning to >> gather up folks who are interested in various efforts for Iceberg V4. To >> that end, >> >>>>>> I'd like to use this thread as a gathering point for folks >> interested in the metadata file format shift to Parquet. I wrote a quick >> abstract to >> >>>>>> describe the purpose of this group. >> >>>>>> >> >>>>>> Following this I'll be working on a full design document or if >> someone has one in prod please let us know and we can start >> discussing/working on >> >>>>>> it there. >> >>>>>> >> >>>>>> Abstract: Parquet as Metadata File Format >> >>>>>> >> >>>>>> Currently the Iceberg SDK and Spec use Avro file format files for >> all Manifest Lists and Manifests. The row oriented format was selected >> >>>>>> because it was assumed that most metadata would be read in its >> entirety. This has turned out to seldom be the case and the ability to read >> >>>>>> single elements of the metrics would be very useful for query >> planning. To address this we propose switching the underlying manifest >> format >> >>>>>> from Avro to Parquet. In V4, Avro files would still be readable >> but all new metadata files would be written in Parquet instead of Avro. >> >