Point definitely taken. We really should probably POC some of these ideas and see what we are actually dealing with. (He said without volunteering to do the work :P)
On Tue, May 27, 2025 at 11:55 AM Selcuk Aya <selcuk....@snowflake.com.invalid> wrote: > Yes having to rewrite the whole file is not ideal but I believe most of > the cost of rewriting a file comes from decompression, encoding, stats > calculations etc. If you are adding new values for some columns but are > keeping the rest of the columns the same in the file, then a bunch of > rewrite cost can be optimized away. I am not saying this is better than > writing to a separate file, I am not sure how much worse it is though. > > On Tue, May 27, 2025 at 9:40 AM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> I think that "after the fact" modification is one of the requirements >> here, IE: Updating a single column without rewriting the whole file. >> If we have to write new metadata for the file aren't we in the same boat >> as having to rewrite the whole file? >> >> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya >> <selcuk....@snowflake.com.invalid> wrote: >> >>> If files represent column projections of a table rather than the whole >>> columns in the table, then any read that reads across these files needs to >>> identify what constitutes a row. Lance DB for example has vertical >>> partitioning across columns but also horizontal partitioning across rows >>> such that in each horizontal partitioning(fragment), the same number of >>> rows exist in each vertical partition, which I think is necessary to make >>> whole/partial row construction cheap. If this is the case, there is no >>> reason not to achieve the same data layout inside a single columnar file >>> with a lean header. I think the only valid argument for a separate file is >>> adding a new set of columns to an existing table, but even then I am not >>> sure a separate file is absolutely necessary for good performance. >>> >>> Selcuk >>> >>> On Tue, May 27, 2025 at 9:18 AM Devin Smith >>> <devinsm...@deephaven.io.invalid> wrote: >>> >>>> There's a `file_path` field in the parquet ColumnChunk structure, >>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962 >>>> >>>> I'm not sure what tooling actually supports this though. Could be >>>> interesting to see what the history of this is. >>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3, >>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw >>>> >>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer < >>>> russell.spit...@gmail.com> wrote: >>>> >>>>> I have to agree that while there can be some fixes in Parquet, we >>>>> fundamentally need a way to split a "row group" >>>>> or something like that between separate files. If that's something we >>>>> can do in the parquet project that would be great >>>>> but it feels like we need to start exploring more drastic options than >>>>> footer encoding. >>>>> >>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> wrote: >>>>> >>>>>> I agree with Steven that there are limitations that Parquet cannot do. >>>>>> >>>>>> In addition to adding new columns by rewriting all files, files of >>>>>> wide tables may suffer from bad performance like below: >>>>>> - Poor compression of row groups because there are too many columns >>>>>> and even a small number of rows can reach the row group threshold. >>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size of a row >>>>>> group, leading to unbalanced column chunks and deteriorate the row group >>>>>> compression. >>>>>> - Similar to adding new columns, partial update also requires >>>>>> rewriting all columns of the affected rows. >>>>>> >>>>>> IIRC, some table formats already support splitting columns into >>>>>> different files: >>>>>> - Lance manifest splits a fragment [1] into one or more data files. >>>>>> - Apache Hudi has the concept of column family [2]. >>>>>> - Apache Paimon supports sequence groups [3] for partial update. >>>>>> >>>>>> Although Parquet can introduce the concept of logical file and >>>>>> physical file to manage the columns to file mapping, this looks like yet >>>>>> another manifest file design which duplicates the purpose of Iceberg. >>>>>> These might be something worth exploring in Iceberg. >>>>>> >>>>>> [1] https://lancedb.github.io/lance/format.html#fragments >>>>>> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md >>>>>> [3] >>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group >>>>>> >>>>>> Best, >>>>>> Gang >>>>>> >>>>>> >>>>>> >>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly addressing >>>>>>> the read performance due to bloated metadata. >>>>>>> >>>>>>> What Peter described in the description seems useful for some ML >>>>>>> workload of feature engineering. A new set of features/columns are >>>>>>> added to >>>>>>> the table. Currently, Iceberg would require rewriting all data files to >>>>>>> combine old and new columns (write amplification). Similarly, in the >>>>>>> past >>>>>>> the community also talked about the use cases of updating a single >>>>>>> column, >>>>>>> which would require rewriting all data files. >>>>>>> >>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry < >>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>> >>>>>>>> Do you have the link at hand for the thread where this was >>>>>>>> discussed on the Parquet list? >>>>>>>> The docs seem quite old, and the PR stale, so I would like to >>>>>>>> understand the situation better. >>>>>>>> If it is possible to do this in Parquet, that would be great, but >>>>>>>> Avro, ORC would still suffer. >>>>>>>> >>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025. máj. >>>>>>>> 26., H, 22:07): >>>>>>>> >>>>>>>>> Hey Peter, >>>>>>>>> >>>>>>>>> Thanks for bringing this issue up. I think I agree with Fokko; the >>>>>>>>> issue of wide tables leading to Parquet metadata bloat and poor Thrift >>>>>>>>> deserialization performance is a long standing issue that I believe >>>>>>>>> there's >>>>>>>>> motivation in the community to address. So to me it seems better to >>>>>>>>> address >>>>>>>>> it in Parquet itself rather than Iceberg library facilitate a pattern >>>>>>>>> which >>>>>>>>> works around the limitations. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Amogh Jahagirdar >>>>>>>>> >>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong <fo...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Peter, >>>>>>>>>> >>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense to fix >>>>>>>>>> this in Parquet itself? It has been a long-running issue on Parquet, >>>>>>>>>> and >>>>>>>>>> there is still active interest from the community. There is a PR to >>>>>>>>>> replace >>>>>>>>>> the footer with FlatBuffers, which dramatically improves >>>>>>>>>> performance <https://github.com/apache/arrow/pull/43793>. The >>>>>>>>>> underlying proposal can be found here >>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa> >>>>>>>>>> . >>>>>>>>>> >>>>>>>>>> Kind regards, >>>>>>>>>> Fokko >>>>>>>>>> >>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou < >>>>>>>>>> yunzou.colost...@gmail.com>: >>>>>>>>>> >>>>>>>>>>> +1, I am really interested in this topic. Performance has always >>>>>>>>>>> been a problem when dealing with wide tables, not just read/write, >>>>>>>>>>> but also >>>>>>>>>>> during compilation. Most of the ML use cases typically exhibit a >>>>>>>>>>> vectorized >>>>>>>>>>> read/write pattern, I am also wondering if there is any way at the >>>>>>>>>>> metadata >>>>>>>>>>> level to help the whole compilation and execution process. I do not >>>>>>>>>>> have >>>>>>>>>>> any answer fo this yet, but I would be really interested in >>>>>>>>>>> exploring this >>>>>>>>>>> further. >>>>>>>>>>> >>>>>>>>>>> Best Regards, >>>>>>>>>>> Yun >>>>>>>>>>> >>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang >>>>>>>>>>> <py...@pinterest.com.invalid> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I am >>>>>>>>>>>> curious if there is a similar story on the write side as well (how >>>>>>>>>>>> to >>>>>>>>>>>> generate these splitted files) and specifically, are you targeting >>>>>>>>>>>> feature >>>>>>>>>>>> backfill use cases in ML use? >>>>>>>>>>>> >>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry < >>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Team, >>>>>>>>>>>>> >>>>>>>>>>>>> In machine learning use-cases, it's common to encounter tables >>>>>>>>>>>>> with a very high number of columns - sometimes even in the range >>>>>>>>>>>>> of several >>>>>>>>>>>>> thousand. I've seen cases with up to 15,000 columns. Storing such >>>>>>>>>>>>> wide >>>>>>>>>>>>> tables in a single Parquet file is often suboptimal, as Parquet >>>>>>>>>>>>> can become >>>>>>>>>>>>> a bottleneck, even when only a subset of columns is queried. >>>>>>>>>>>>> >>>>>>>>>>>>> A common approach to mitigate this is to split the data across >>>>>>>>>>>>> multiple Parquet files. With the upcoming File Format API, we >>>>>>>>>>>>> could >>>>>>>>>>>>> introduce a layer that combines these files into a single >>>>>>>>>>>>> iterator, >>>>>>>>>>>>> enabling efficient reading of wide and very wide tables. >>>>>>>>>>>>> >>>>>>>>>>>>> To support this, we would need to revise the metadata >>>>>>>>>>>>> specification. Instead of the current `_file` column, we could >>>>>>>>>>>>> introduce a >>>>>>>>>>>>> _files column containing: >>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each file >>>>>>>>>>>>> - `_file_path`: the path to the corresponding file >>>>>>>>>>>>> >>>>>>>>>>>>> Has there been any prior discussion around this idea? >>>>>>>>>>>>> Is anyone else interested in exploring this further? >>>>>>>>>>>>> >>>>>>>>>>>>> Best regards, >>>>>>>>>>>>> Peter >>>>>>>>>>>>> >>>>>>>>>>>>