I'm interested in working on this change as well. I think it pairs nicely
with the proposal for per column structs for statistics.

Thanks,
Harman

On Thu, Jun 12, 2025 at 9:43 PM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> It’s not required at compile time, only at test runtime.
>
> On Thu, Jun 12, 2025 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com>
> wrote:
>
>> > All we have to do is add the parquet module as a test dependency,
>> working on a poc now.
>>
>> This will be a circular dependency on the core module. That's why I
>> suggested abstracting out the test cases and executing them in a parquet
>> module. Partition stats writing (as parquet) from the core module uses
>> `InternalData` and does the same now. So, I guess it will be a similar work
>> (but on a larger scale due to testcase refactoring).
>>
>> Let me know the results of your POC and happy to collaborate on this
>> work.
>>
>>
>> - Ajantha
>>
>> On Fri, Jun 13, 2025 at 3:16 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>>> All we have to do is add the parquet module as a test dependency,
>>> working on a poc now. I don't think we really need to block on any other
>>> projects although I'll probably hold off on any work on manifest-list since
>>> I hope it won't be needed.
>>>
>>> On Thu, May 29, 2025 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com>
>>> wrote:
>>>
>>>> I am interested in working on this proposal.
>>>> I would assume it is to use `InternalData` with the format as
>>>> `parquet`. But the challenge will be the test cases, the core module cannot
>>>> write the parquet metadata due to circular dependency. We need to abstract
>>>> out the test cases in the core module and run them from the parquet module
>>>> I guess.
>>>>
>>>> I can work on a design doc as well. So, add me as a collaborator for
>>>> the document.
>>>> But should this work be done after we complete the work on "single file
>>>> commit in v4" ? because metadata structure can change?
>>>>
>>>> - Ajantha
>>>>
>>>> On Thu, May 29, 2025 at 11:37 PM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> Hi Y'all
>>>>>
>>>>> As discussed in the last community sync, we are beginning to gather up
>>>>> folks who are interested in various efforts for Iceberg V4. To that end,
>>>>> I'd like to use this thread as a gathering point for folks
>>>>> interested in the metadata file format shift to Parquet. I wrote a quick
>>>>> abstract to
>>>>> describe the purpose of this group.
>>>>>
>>>>> Following this I'll be working on a full design document or if someone
>>>>> has one in prod please let us know and we can start discussing/working on
>>>>> it there.
>>>>>
>>>>> *Abstract: Parquet as Metadata File Format*
>>>>>
>>>>> Currently the Iceberg SDK and Spec use Avro file format files for all
>>>>> Manifest Lists and Manifests. The row oriented format was selected
>>>>> because it was assumed that most metadata would be read in its
>>>>> entirety. This has turned out to seldom be the case and the ability to 
>>>>> read
>>>>> single elements of the metrics would be very useful for query
>>>>> planning. To address this we propose switching the underlying manifest
>>>>> format
>>>>> from Avro to Parquet. In V4, Avro files would still be readable but
>>>>> all new metadata files would be written in Parquet instead of Avro.
>>>>>
>>>>

Reply via email to