I'm interested in working on this change as well. I think it pairs nicely with the proposal for per column structs for statistics.
Thanks, Harman On Thu, Jun 12, 2025 at 9:43 PM Russell Spitzer <russell.spit...@gmail.com> wrote: > It’s not required at compile time, only at test runtime. > > On Thu, Jun 12, 2025 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >> > All we have to do is add the parquet module as a test dependency, >> working on a poc now. >> >> This will be a circular dependency on the core module. That's why I >> suggested abstracting out the test cases and executing them in a parquet >> module. Partition stats writing (as parquet) from the core module uses >> `InternalData` and does the same now. So, I guess it will be a similar work >> (but on a larger scale due to testcase refactoring). >> >> Let me know the results of your POC and happy to collaborate on this >> work. >> >> >> - Ajantha >> >> On Fri, Jun 13, 2025 at 3:16 AM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> All we have to do is add the parquet module as a test dependency, >>> working on a poc now. I don't think we really need to block on any other >>> projects although I'll probably hold off on any work on manifest-list since >>> I hope it won't be needed. >>> >>> On Thu, May 29, 2025 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>>> I am interested in working on this proposal. >>>> I would assume it is to use `InternalData` with the format as >>>> `parquet`. But the challenge will be the test cases, the core module cannot >>>> write the parquet metadata due to circular dependency. We need to abstract >>>> out the test cases in the core module and run them from the parquet module >>>> I guess. >>>> >>>> I can work on a design doc as well. So, add me as a collaborator for >>>> the document. >>>> But should this work be done after we complete the work on "single file >>>> commit in v4" ? because metadata structure can change? >>>> >>>> - Ajantha >>>> >>>> On Thu, May 29, 2025 at 11:37 PM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> Hi Y'all >>>>> >>>>> As discussed in the last community sync, we are beginning to gather up >>>>> folks who are interested in various efforts for Iceberg V4. To that end, >>>>> I'd like to use this thread as a gathering point for folks >>>>> interested in the metadata file format shift to Parquet. I wrote a quick >>>>> abstract to >>>>> describe the purpose of this group. >>>>> >>>>> Following this I'll be working on a full design document or if someone >>>>> has one in prod please let us know and we can start discussing/working on >>>>> it there. >>>>> >>>>> *Abstract: Parquet as Metadata File Format* >>>>> >>>>> Currently the Iceberg SDK and Spec use Avro file format files for all >>>>> Manifest Lists and Manifests. The row oriented format was selected >>>>> because it was assumed that most metadata would be read in its >>>>> entirety. This has turned out to seldom be the case and the ability to >>>>> read >>>>> single elements of the metrics would be very useful for query >>>>> planning. To address this we propose switching the underlying manifest >>>>> format >>>>> from Avro to Parquet. In V4, Avro files would still be readable but >>>>> all new metadata files would be written in Parquet instead of Avro. >>>>> >>>>