Hey Peter, Thanks for bringing this issue up. I think I agree with Fokko; the issue of wide tables leading to Parquet metadata bloat and poor Thrift deserialization performance is a long standing issue that I believe there's motivation in the community to address. So to me it seems better to address it in Parquet itself rather than Iceberg library facilitate a pattern which works around the limitations.
Thanks, Amogh Jahagirdar On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong <fo...@apache.org> wrote: > Hi Peter, > > Thanks for bringing this up. Wouldn't it make more sense to fix this in > Parquet itself? It has been a long-running issue on Parquet, and there is > still active interest from the community. There is a PR to replace the > footer with FlatBuffers, which dramatically improves performance > <https://github.com/apache/arrow/pull/43793>. The underlying proposal can > be found here > <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa> > . > > Kind regards, > Fokko > > Op ma 26 mei 2025 om 20:35 schreef yun zou <yunzou.colost...@gmail.com>: > >> +1, I am really interested in this topic. Performance has always been a >> problem when dealing with wide tables, not just read/write, but also during >> compilation. Most of the ML use cases typically exhibit a vectorized >> read/write pattern, I am also wondering if there is any way at the metadata >> level to help the whole compilation and execution process. I do not have >> any answer fo this yet, but I would be really interested in exploring this >> further. >> >> Best Regards, >> Yun >> >> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang <py...@pinterest.com.invalid> >> wrote: >> >>> Hi Peter, I am interested in this proposal. What's more, I am curious if >>> there is a similar story on the write side as well (how to generate these >>> splitted files) and specifically, are you targeting feature backfill use >>> cases in ML use? >>> >>> On Mon, May 26, 2025 at 6:29 AM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> Hi Team, >>>> >>>> In machine learning use-cases, it's common to encounter tables with a >>>> very high number of columns - sometimes even in the range of several >>>> thousand. I've seen cases with up to 15,000 columns. Storing such wide >>>> tables in a single Parquet file is often suboptimal, as Parquet can become >>>> a bottleneck, even when only a subset of columns is queried. >>>> >>>> A common approach to mitigate this is to split the data across multiple >>>> Parquet files. With the upcoming File Format API, we could introduce a >>>> layer that combines these files into a single iterator, enabling efficient >>>> reading of wide and very wide tables. >>>> >>>> To support this, we would need to revise the metadata specification. >>>> Instead of the current `_file` column, we could introduce a _files column >>>> containing: >>>> - `_file_column_ids`: the column IDs present in each file >>>> - `_file_path`: the path to the corresponding file >>>> >>>> Has there been any prior discussion around this idea? >>>> Is anyone else interested in exploring this further? >>>> >>>> Best regards, >>>> Peter >>>> >>>