The Parquet metadata proposal (linked by Fokko) is mainly addressing
the read performance due to bloated metadata.

What Peter described in the description seems useful for some ML workload
of feature engineering. A new set of features/columns are added to the
table. Currently, Iceberg  would require rewriting all data files to
combine old and new columns (write amplification). Similarly, in the past
the community also talked about the use cases of updating a single column,
which would require rewriting all data files.

On Mon, May 26, 2025 at 2:42 PM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> Do you have the link at hand for the thread where this was discussed on
> the Parquet list?
> The docs seem quite old, and the PR stale, so I would like to understand
> the situation better.
> If it is possible to do this in Parquet, that would be great, but Avro,
> ORC would still suffer.
>
> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025. máj. 26., H,
> 22:07):
>
>> Hey Peter,
>>
>> Thanks for bringing this issue up. I think I agree with Fokko; the issue
>> of wide tables leading to Parquet metadata bloat and poor Thrift
>> deserialization performance is a long standing issue that I believe there's
>> motivation in the community to address. So to me it seems better to address
>> it in Parquet itself rather than Iceberg library facilitate a pattern which
>> works around the limitations.
>>
>> Thanks,
>> Amogh Jahagirdar
>>
>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong <fo...@apache.org>
>> wrote:
>>
>>> Hi Peter,
>>>
>>> Thanks for bringing this up. Wouldn't it make more sense to fix this in
>>> Parquet itself? It has been a long-running issue on Parquet, and there is
>>> still active interest from the community. There is a PR to replace the
>>> footer with FlatBuffers, which dramatically improves performance
>>> <https://github.com/apache/arrow/pull/43793>. The underlying proposal
>>> can be found here
>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa>
>>> .
>>>
>>> Kind regards,
>>> Fokko
>>>
>>> Op ma 26 mei 2025 om 20:35 schreef yun zou <yunzou.colost...@gmail.com>:
>>>
>>>> +1, I am really interested in this topic. Performance has always been a
>>>> problem when dealing with wide tables, not just read/write, but also during
>>>> compilation. Most of the ML use cases typically exhibit a vectorized
>>>> read/write pattern, I am also wondering if there is any way at the metadata
>>>> level to help the whole compilation and execution process. I do not have
>>>> any answer fo this yet, but I would be really interested in exploring this
>>>> further.
>>>>
>>>> Best Regards,
>>>> Yun
>>>>
>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang
>>>> <py...@pinterest.com.invalid> wrote:
>>>>
>>>>> Hi Peter, I am interested in this proposal. What's more, I am curious
>>>>> if there is a similar story on the write side as well (how to generate
>>>>> these splitted files) and specifically, are you targeting feature backfill
>>>>> use cases in ML use?
>>>>>
>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry <
>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>
>>>>>> Hi Team,
>>>>>>
>>>>>> In machine learning use-cases, it's common to encounter tables with a
>>>>>> very high number of columns - sometimes even in the range of several
>>>>>> thousand. I've seen cases with up to 15,000 columns. Storing such wide
>>>>>> tables in a single Parquet file is often suboptimal, as Parquet can 
>>>>>> become
>>>>>> a bottleneck, even when only a subset of columns is queried.
>>>>>>
>>>>>> A common approach to mitigate this is to split the data across
>>>>>> multiple Parquet files. With the upcoming File Format API, we could
>>>>>> introduce a layer that combines these files into a single iterator,
>>>>>> enabling efficient reading of wide and very wide tables.
>>>>>>
>>>>>> To support this, we would need to revise the metadata specification.
>>>>>> Instead of the current `_file` column, we could introduce a _files column
>>>>>> containing:
>>>>>> - `_file_column_ids`: the column IDs present in each file
>>>>>> - `_file_path`: the path to the corresponding file
>>>>>>
>>>>>> Has there been any prior discussion around this idea?
>>>>>> Is anyone else interested in exploring this further?
>>>>>>
>>>>>> Best regards,
>>>>>> Peter
>>>>>>
>>>>>

Reply via email to