Re: [DISCUSS] V4 - Parquet as Metadata File Format

Jacky Lee Mon, 16 Jun 2025 18:45:48 -0700

Count me in. This solution effectively addresses the small files issue
caused by high-frequency writes in our scenario, and it also greatly
benefits the generation of partition- and table-level statistics.


<mlhsmode...@gmail.com> 于2025年6月14日周六 07:04写道：
>
> I'm interested in working on this change as well. I think it pairs nicely 
> with the proposal for per column structs for statistics.
>
> Thanks,
> Harman
>
> On Thu, Jun 12, 2025 at 9:43 PM Russell Spitzer <russell.spit...@gmail.com> 
> wrote:
>>
>> It’s not required at compile time, only at test runtime.
>>
>> On Thu, Jun 12, 2025 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> wrote:
>>>
>>> > All we have to do is add the parquet module as a test dependency, working 
>>> > on a poc now.
>>>
>>> This will be a circular dependency on the core module. That's why I 
>>> suggested abstracting out the test cases and executing them in a parquet 
>>> module. Partition stats writing (as parquet) from the core module uses 
>>> `InternalData` and does the same now. So, I guess it will be a similar work 
>>> (but on a larger scale due to testcase refactoring).
>>>
>>> Let me know the results of your POC and happy to collaborate on this work.
>>>
>>>
>>> - Ajantha
>>>
>>> On Fri, Jun 13, 2025 at 3:16 AM Russell Spitzer <russell.spit...@gmail.com> 
>>> wrote:
>>>>
>>>> All we have to do is add the parquet module as a test dependency, working 
>>>> on a poc now. I don't think we really need to block on any other projects 
>>>> although I'll probably hold off on any work on manifest-list since I hope 
>>>> it won't be needed.
>>>>
>>>> On Thu, May 29, 2025 at 8:37 PM Ajantha Bhat <ajanthab...@gmail.com> wrote:
>>>>>
>>>>> I am interested in working on this proposal.
>>>>> I would assume it is to use `InternalData` with the format as `parquet`. 
>>>>> But the challenge will be the test cases, the core module cannot write 
>>>>> the parquet metadata due to circular dependency. We need to abstract out 
>>>>> the test cases in the core module and run them from the parquet module I 
>>>>> guess.
>>>>>
>>>>> I can work on a design doc as well. So, add me as a collaborator for the 
>>>>> document.
>>>>> But should this work be done after we complete the work on "single file 
>>>>> commit in v4" ? because metadata structure can change?
>>>>>
>>>>> - Ajantha
>>>>>
>>>>> On Thu, May 29, 2025 at 11:37 PM Russell Spitzer 
>>>>> <russell.spit...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Y'all
>>>>>>
>>>>>> As discussed in the last community sync, we are beginning to gather up 
>>>>>> folks who are interested in various efforts for Iceberg V4. To that end,
>>>>>> I'd like to use this thread as a gathering point for folks interested in 
>>>>>> the metadata file format shift to Parquet. I wrote a quick abstract to
>>>>>> describe the purpose of this group.
>>>>>>
>>>>>> Following this I'll be working on a full design document or if someone 
>>>>>> has one in prod please let us know and we can start discussing/working on
>>>>>> it there.
>>>>>>
>>>>>> Abstract: Parquet as Metadata File Format
>>>>>>
>>>>>> Currently the Iceberg SDK and Spec use Avro file format files for all 
>>>>>> Manifest Lists and Manifests. The row oriented format was selected
>>>>>> because it was assumed that most metadata would be read in its entirety. 
>>>>>> This has turned out to seldom be the case and the ability to read
>>>>>> single elements of the metrics would be very useful for query planning. 
>>>>>> To address this we propose switching the underlying manifest format
>>>>>> from Avro to Parquet. In V4, Avro files would still be readable but all 
>>>>>> new metadata files would be written in Parquet instead of Avro.

Re: [DISCUSS] V4 - Parquet as Metadata File Format

Reply via email to