There's a `file_path` field in the parquet ColumnChunk structure, https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
I'm not sure what tooling actually supports this though. Could be interesting to see what the history of this is. https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3, https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <russell.spit...@gmail.com> wrote: > I have to agree that while there can be some fixes in Parquet, we > fundamentally need a way to split a "row group" > or something like that between separate files. If that's something we can > do in the parquet project that would be great > but it feels like we need to start exploring more drastic options than > footer encoding. > > On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> wrote: > >> I agree with Steven that there are limitations that Parquet cannot do. >> >> In addition to adding new columns by rewriting all files, files of wide >> tables may suffer from bad performance like below: >> - Poor compression of row groups because there are too many columns and >> even a small number of rows can reach the row group threshold. >> - Dominating columns (e.g. blobs) may contribute to 99% size of a row >> group, leading to unbalanced column chunks and deteriorate the row group >> compression. >> - Similar to adding new columns, partial update also requires rewriting >> all columns of the affected rows. >> >> IIRC, some table formats already support splitting columns into different >> files: >> - Lance manifest splits a fragment [1] into one or more data files. >> - Apache Hudi has the concept of column family [2]. >> - Apache Paimon supports sequence groups [3] for partial update. >> >> Although Parquet can introduce the concept of logical file and physical >> file to manage the columns to file mapping, this looks like yet another >> manifest file design which duplicates the purpose of Iceberg. >> These might be something worth exploring in Iceberg. >> >> [1] https://lancedb.github.io/lance/format.html#fragments >> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md >> [3] >> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group >> >> Best, >> Gang >> >> >> >> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com> wrote: >> >>> The Parquet metadata proposal (linked by Fokko) is mainly addressing >>> the read performance due to bloated metadata. >>> >>> What Peter described in the description seems useful for some ML >>> workload of feature engineering. A new set of features/columns are added to >>> the table. Currently, Iceberg would require rewriting all data files to >>> combine old and new columns (write amplification). Similarly, in the past >>> the community also talked about the use cases of updating a single column, >>> which would require rewriting all data files. >>> >>> On Mon, May 26, 2025 at 2:42 PM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> Do you have the link at hand for the thread where this was discussed on >>>> the Parquet list? >>>> The docs seem quite old, and the PR stale, so I would like to >>>> understand the situation better. >>>> If it is possible to do this in Parquet, that would be great, but Avro, >>>> ORC would still suffer. >>>> >>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025. máj. 26., >>>> H, 22:07): >>>> >>>>> Hey Peter, >>>>> >>>>> Thanks for bringing this issue up. I think I agree with Fokko; the >>>>> issue of wide tables leading to Parquet metadata bloat and poor Thrift >>>>> deserialization performance is a long standing issue that I believe >>>>> there's >>>>> motivation in the community to address. So to me it seems better to >>>>> address >>>>> it in Parquet itself rather than Iceberg library facilitate a pattern >>>>> which >>>>> works around the limitations. >>>>> >>>>> Thanks, >>>>> Amogh Jahagirdar >>>>> >>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong <fo...@apache.org> >>>>> wrote: >>>>> >>>>>> Hi Peter, >>>>>> >>>>>> Thanks for bringing this up. Wouldn't it make more sense to fix this >>>>>> in Parquet itself? It has been a long-running issue on Parquet, and there >>>>>> is still active interest from the community. There is a PR to replace the >>>>>> footer with FlatBuffers, which dramatically improves performance >>>>>> <https://github.com/apache/arrow/pull/43793>. The underlying >>>>>> proposal can be found here >>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa> >>>>>> . >>>>>> >>>>>> Kind regards, >>>>>> Fokko >>>>>> >>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou < >>>>>> yunzou.colost...@gmail.com>: >>>>>> >>>>>>> +1, I am really interested in this topic. Performance has always >>>>>>> been a problem when dealing with wide tables, not just read/write, but >>>>>>> also >>>>>>> during compilation. Most of the ML use cases typically exhibit a >>>>>>> vectorized >>>>>>> read/write pattern, I am also wondering if there is any way at the >>>>>>> metadata >>>>>>> level to help the whole compilation and execution process. I do not have >>>>>>> any answer fo this yet, but I would be really interested in exploring >>>>>>> this >>>>>>> further. >>>>>>> >>>>>>> Best Regards, >>>>>>> Yun >>>>>>> >>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang >>>>>>> <py...@pinterest.com.invalid> wrote: >>>>>>> >>>>>>>> Hi Peter, I am interested in this proposal. What's more, I am >>>>>>>> curious if there is a similar story on the write side as well (how to >>>>>>>> generate these splitted files) and specifically, are you targeting >>>>>>>> feature >>>>>>>> backfill use cases in ML use? >>>>>>>> >>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry < >>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Team, >>>>>>>>> >>>>>>>>> In machine learning use-cases, it's common to encounter tables >>>>>>>>> with a very high number of columns - sometimes even in the range of >>>>>>>>> several >>>>>>>>> thousand. I've seen cases with up to 15,000 columns. Storing such wide >>>>>>>>> tables in a single Parquet file is often suboptimal, as Parquet can >>>>>>>>> become >>>>>>>>> a bottleneck, even when only a subset of columns is queried. >>>>>>>>> >>>>>>>>> A common approach to mitigate this is to split the data across >>>>>>>>> multiple Parquet files. With the upcoming File Format API, we could >>>>>>>>> introduce a layer that combines these files into a single iterator, >>>>>>>>> enabling efficient reading of wide and very wide tables. >>>>>>>>> >>>>>>>>> To support this, we would need to revise the metadata >>>>>>>>> specification. Instead of the current `_file` column, we could >>>>>>>>> introduce a >>>>>>>>> _files column containing: >>>>>>>>> - `_file_column_ids`: the column IDs present in each file >>>>>>>>> - `_file_path`: the path to the corresponding file >>>>>>>>> >>>>>>>>> Has there been any prior discussion around this idea? >>>>>>>>> Is anyone else interested in exploring this further? >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> Peter >>>>>>>>> >>>>>>>>