On Fri, May 30, 2025 at 3:33 PM Péter Váry <peter.vary.apa...@gmail.com> wrote:
> One key advantage of introducing Physical Files is the flexibility to vary > RowGroup sizes across columns. For instance, wide string columns could > benefit from smaller RowGroups to reduce memory pressure, while numeric > columns could use larger RowGroups to improve compression and scan > efficiency. Rather than enforcing strict row group alignment across all > columns, we can explore optimizing read split sizes and write-time RowGroup > sizes independently - striking a balance that maximizes performance and > storage costs for different data types and queries. > That actually sounds very complicated if you want to split file reads in a distributed system. If you want to read across column groups, then you always end up over-reading on one of them if they are not aligned. And aren't Parquet pages already providing these unaligned sizes? Gang Wu <ust...@gmail.com> ezt írta (időpont: 2025. máj. 30., P, 8:09): > >> IMO, the main drawback for the view solution is the complexity of >> maintaining consistency across tables if we want to use features like time >> travel, incremental scan, branch & tag, encryption, etc. >> >> On Fri, May 30, 2025 at 12:55 PM Bryan Keller <brya...@gmail.com> wrote: >> >>> Fewer commit conflicts meaning the tables representing column families >>> are updated independently, rather than having to serialize commits to a >>> single table. Perhaps with a wide table solution the commit logic could be >>> enhanced to support things like concurrent overwrites to independent column >>> families, but it seems like it would be fairly involved. >>> >>> >>> On May 29, 2025, at 7:16 PM, Steven Wu <stevenz...@gmail.com> wrote: >>> >>> Bryan, interesting approach to split horizontally across multiple >>> tables. >>> >>> A few potential down sides >>> * operational overhead. tables need to be managed consistently and >>> probably in some coordinated way >>> * complex read >>> * maybe fragile to enforce correctness (during join). It is robust to >>> enforce the stitching correctness at file group level in file reader and >>> writer if built in the table format. >>> >>> > fewer commit conflicts >>> >>> Can you elaborate on this one? Are those tables populated by streaming >>> or batch pipelines? >>> >>> On Thu, May 29, 2025 at 5:03 PM Bryan Keller <brya...@gmail.com> wrote: >>> >>>> Hi everyone, >>>> >>>> We have been investigating a wide table format internally for a similar >>>> use case, i.e. we have wide ML tables with features generated by different >>>> pipelines and teams but want a unified view of the data. We are comparing >>>> that against separate tables joined together using a shuffle-less join >>>> (e.g. storage partition join), along with a corresponding view. >>>> >>>> The join/view approach seems to give us much of we need, with some >>>> added benefits like splitting up the metadata, fewer commit conflicts, and >>>> ability to share, nest, and swap "column families". The downsides are table >>>> management is split across multiple tables, it requires engine support of >>>> shuffle-less joins for best performance, and even then, scans probably >>>> won't be as optimal. >>>> >>>> I'm curious if anyone had further thoughts on the two? >>>> >>>> -Bryan >>>> >>>> >>>> >>>> On May 29, 2025, at 8:18 AM, Péter Váry <peter.vary.apa...@gmail.com> >>>> wrote: >>>> >>>> I received feedback from Alkis regarding their Parquet optimization >>>> work. Their internal testing shows promising results for reducing metadata >>>> size and improving parsing performance. They plan to formalize a proposal >>>> for these Parquet enhancements in the near future. >>>> >>>> Meanwhile, I'm putting together our horizontal sharding proposal as a >>>> complementary approach. Even with the Parquet metadata improvements, >>>> horizontal sharding would provide additional benefits for: >>>> >>>> - More efficient column-level updates >>>> - Streamlined column additions >>>> - Better handling of dominant columns that can cause RowGroup size >>>> imbalances (placing these in separate files could significantly improve >>>> performance) >>>> >>>> Thanks, Peter >>>> >>>> >>>> >>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. máj. >>>> 28., Sze, 15:39): >>>> >>>>> I would be happy to put together a proposal based on the inputs got >>>>> here. >>>>> >>>>> Thanks everyone for your thoughts! >>>>> I will try to incorporate all of this. >>>>> >>>>> Thanks, Peter >>>>> >>>>> Daniel Weeks <dwe...@apache.org> ezt írta (időpont: 2025. máj. 27., >>>>> K, 20:07): >>>>> >>>>>> I feel like we have two different issues we're talking about here >>>>>> that aren't necessarily tied (though solutions may address both): 1) wide >>>>>> tables, 2) adding columns >>>>>> >>>>>> Wide tables are definitely a problem where parquet has limitations. >>>>>> I'm optimistic about the ongoing work to help improve parquet >>>>>> footers/stats >>>>>> in this area that Fokko mentioned. There are always limitations in how >>>>>> this scales as wide rows lead to small row groups and the cost to >>>>>> reconstitute a row gets more expensive, but for cases that are read heavy >>>>>> and projecting subsets of columns should significantly improve >>>>>> performance. >>>>>> >>>>>> Adding columns to an existing dataset is something that comes up >>>>>> periodically, but there's a lot of complexity involved in this. Parquet >>>>>> does support referencing columns in separate files per the spec, but >>>>>> there's no implementation that takes advantage of this to my knowledge. >>>>>> This does allow for approaches where you separate/rewrite just the >>>>>> footers >>>>>> or various other tricks, but these approaches get complicated quickly and >>>>>> the number of readers that can consume those representations would >>>>>> initially be very limited. >>>>>> >>>>>> A larger problem for splitting columns across files is that there are >>>>>> a lot of assumptions about how data is laid out in both readers and >>>>>> writers. For example, aligning row groups and correctly handling split >>>>>> calculation is very complicated if you're trying to split rows across >>>>>> files. Other features are also impacted like deletes, which reference >>>>>> the >>>>>> file to which they apply and would need to account for deletes applying >>>>>> to >>>>>> multiple files and needing to update those references if columns are >>>>>> added. >>>>>> >>>>>> I believe there are a lot of interesting approaches to addressing >>>>>> these use cases, but we'd really need a thorough proposal that explores >>>>>> all >>>>>> of these scenarios. The last thing we would want is to introduce >>>>>> incompatibilities within the format that result in incompatible features. >>>>>> >>>>>> -Dan >>>>>> >>>>>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer < >>>>>> russell.spit...@gmail.com> wrote: >>>>>> >>>>>>> Point definitely taken. We really should probably POC some of >>>>>>> these ideas and see what we are actually dealing with. (He said without >>>>>>> volunteering to do the work :P) >>>>>>> >>>>>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya >>>>>>> <selcuk....@snowflake.com.invalid> wrote: >>>>>>> >>>>>>>> Yes having to rewrite the whole file is not ideal but I believe >>>>>>>> most of the cost of rewriting a file comes from decompression, >>>>>>>> encoding, >>>>>>>> stats calculations etc. If you are adding new values for some columns >>>>>>>> but >>>>>>>> are keeping the rest of the columns the same in the file, then a bunch >>>>>>>> of >>>>>>>> rewrite cost can be optimized away. I am not saying this is better than >>>>>>>> writing to a separate file, I am not sure how much worse it is though. >>>>>>>> >>>>>>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer < >>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I think that "after the fact" modification is one of the >>>>>>>>> requirements here, IE: Updating a single column without rewriting the >>>>>>>>> whole >>>>>>>>> file. >>>>>>>>> If we have to write new metadata for the file aren't we in the >>>>>>>>> same boat as having to rewrite the whole file? >>>>>>>>> >>>>>>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya >>>>>>>>> <selcuk....@snowflake.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> If files represent column projections of a table rather than the >>>>>>>>>> whole columns in the table, then any read that reads across these >>>>>>>>>> files >>>>>>>>>> needs to identify what constitutes a row. Lance DB for example has >>>>>>>>>> vertical >>>>>>>>>> partitioning across columns but also horizontal partitioning across >>>>>>>>>> rows >>>>>>>>>> such that in each horizontal partitioning(fragment), the same number >>>>>>>>>> of >>>>>>>>>> rows exist in each vertical partition, which I think is necessary >>>>>>>>>> to make >>>>>>>>>> whole/partial row construction cheap. If this is the case, there is >>>>>>>>>> no >>>>>>>>>> reason not to achieve the same data layout inside a single columnar >>>>>>>>>> file >>>>>>>>>> with a lean header. I think the only valid argument for a separate >>>>>>>>>> file is >>>>>>>>>> adding a new set of columns to an existing table, but even then I am >>>>>>>>>> not >>>>>>>>>> sure a separate file is absolutely necessary for good performance. >>>>>>>>>> >>>>>>>>>> Selcuk >>>>>>>>>> >>>>>>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith >>>>>>>>>> <devinsm...@deephaven.io.invalid> wrote: >>>>>>>>>> >>>>>>>>>>> There's a `file_path` field in the parquet ColumnChunk >>>>>>>>>>> structure, >>>>>>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962 >>>>>>>>>>> >>>>>>>>>>> I'm not sure what tooling actually supports this though. Could >>>>>>>>>>> be interesting to see what the history of this is. >>>>>>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3, >>>>>>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw >>>>>>>>>>> >>>>>>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer < >>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I have to agree that while there can be some fixes in Parquet, >>>>>>>>>>>> we fundamentally need a way to split a "row group" >>>>>>>>>>>> or something like that between separate files. If that's >>>>>>>>>>>> something we can do in the parquet project that would be great >>>>>>>>>>>> but it feels like we need to start exploring more drastic >>>>>>>>>>>> options than footer encoding. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I agree with Steven that there are limitations that Parquet >>>>>>>>>>>>> cannot do. >>>>>>>>>>>>> >>>>>>>>>>>>> In addition to adding new columns by rewriting all files, >>>>>>>>>>>>> files of wide tables may suffer from bad performance like below: >>>>>>>>>>>>> - Poor compression of row groups because there are too many >>>>>>>>>>>>> columns and even a small number of rows can reach the row group >>>>>>>>>>>>> threshold. >>>>>>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size >>>>>>>>>>>>> of a row group, leading to unbalanced column chunks and >>>>>>>>>>>>> deteriorate the row >>>>>>>>>>>>> group compression. >>>>>>>>>>>>> - Similar to adding new columns, partial update also requires >>>>>>>>>>>>> rewriting all columns of the affected rows. >>>>>>>>>>>>> >>>>>>>>>>>>> IIRC, some table formats already support splitting columns >>>>>>>>>>>>> into different files: >>>>>>>>>>>>> - Lance manifest splits a fragment [1] into one or more data >>>>>>>>>>>>> files. >>>>>>>>>>>>> - Apache Hudi has the concept of column family [2]. >>>>>>>>>>>>> - Apache Paimon supports sequence groups [3] for partial >>>>>>>>>>>>> update. >>>>>>>>>>>>> >>>>>>>>>>>>> Although Parquet can introduce the concept of logical file and >>>>>>>>>>>>> physical file to manage the columns to file mapping, this looks >>>>>>>>>>>>> like yet >>>>>>>>>>>>> another manifest file design which duplicates the purpose of >>>>>>>>>>>>> Iceberg. >>>>>>>>>>>>> These might be something worth exploring in Iceberg. >>>>>>>>>>>>> >>>>>>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments >>>>>>>>>>>>> [2] >>>>>>>>>>>>> https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md >>>>>>>>>>>>> [3] >>>>>>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Gang >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu < >>>>>>>>>>>>> stevenz...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly >>>>>>>>>>>>>> addressing the read performance due to bloated metadata. >>>>>>>>>>>>>> >>>>>>>>>>>>>> What Peter described in the description seems useful for some >>>>>>>>>>>>>> ML workload of feature engineering. A new set of >>>>>>>>>>>>>> features/columns are added >>>>>>>>>>>>>> to the table. Currently, Iceberg would require rewriting all >>>>>>>>>>>>>> data files to >>>>>>>>>>>>>> combine old and new columns (write amplification). Similarly, in >>>>>>>>>>>>>> the past >>>>>>>>>>>>>> the community also talked about the use cases of updating a >>>>>>>>>>>>>> single column, >>>>>>>>>>>>>> which would require rewriting all data files. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry < >>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Do you have the link at hand for the thread where this was >>>>>>>>>>>>>>> discussed on the Parquet list? >>>>>>>>>>>>>>> The docs seem quite old, and the PR stale, so I would like >>>>>>>>>>>>>>> to understand the situation better. >>>>>>>>>>>>>>> If it is possible to do this in Parquet, that would be >>>>>>>>>>>>>>> great, but Avro, ORC would still suffer. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: >>>>>>>>>>>>>>> 2025. máj. 26., H, 22:07): >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hey Peter, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with >>>>>>>>>>>>>>>> Fokko; the issue of wide tables leading to Parquet metadata >>>>>>>>>>>>>>>> bloat and poor >>>>>>>>>>>>>>>> Thrift deserialization performance is a long standing issue >>>>>>>>>>>>>>>> that I believe >>>>>>>>>>>>>>>> there's motivation in the community to address. So to me it >>>>>>>>>>>>>>>> seems better to >>>>>>>>>>>>>>>> address it in Parquet itself rather than Iceberg library >>>>>>>>>>>>>>>> facilitate a >>>>>>>>>>>>>>>> pattern which works around the limitations. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong < >>>>>>>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Peter, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense >>>>>>>>>>>>>>>>> to fix this in Parquet itself? It has been a long-running >>>>>>>>>>>>>>>>> issue on Parquet, >>>>>>>>>>>>>>>>> and there is still active interest from the community. There >>>>>>>>>>>>>>>>> is a PR to >>>>>>>>>>>>>>>>> replace the footer with FlatBuffers, which dramatically >>>>>>>>>>>>>>>>> improves performance >>>>>>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The >>>>>>>>>>>>>>>>> underlying proposal can be found here >>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa> >>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>>>> Fokko >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou < >>>>>>>>>>>>>>>>> yunzou.colost...@gmail.com>: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> +1, I am really interested in this topic. Performance has >>>>>>>>>>>>>>>>>> always been a problem when dealing with wide tables, not >>>>>>>>>>>>>>>>>> just read/write, >>>>>>>>>>>>>>>>>> but also during compilation. Most of the ML use cases >>>>>>>>>>>>>>>>>> typically exhibit a >>>>>>>>>>>>>>>>>> vectorized read/write pattern, I am also wondering if there >>>>>>>>>>>>>>>>>> is any way at >>>>>>>>>>>>>>>>>> the metadata level to help the whole compilation and >>>>>>>>>>>>>>>>>> execution process. I >>>>>>>>>>>>>>>>>> do not have any answer fo this yet, but I would be really >>>>>>>>>>>>>>>>>> interested in >>>>>>>>>>>>>>>>>> exploring this further. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>>>>> Yun >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang >>>>>>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, >>>>>>>>>>>>>>>>>>> I am curious if there is a similar story on the write side >>>>>>>>>>>>>>>>>>> as well (how to >>>>>>>>>>>>>>>>>>> generate these splitted files) and specifically, are you >>>>>>>>>>>>>>>>>>> targeting feature >>>>>>>>>>>>>>>>>>> backfill use cases in ML use? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry < >>>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Team, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter >>>>>>>>>>>>>>>>>>>> tables with a very high number of columns - sometimes even >>>>>>>>>>>>>>>>>>>> in the range of >>>>>>>>>>>>>>>>>>>> several thousand. I've seen cases with up to 15,000 >>>>>>>>>>>>>>>>>>>> columns. Storing such >>>>>>>>>>>>>>>>>>>> wide tables in a single Parquet file is often suboptimal, >>>>>>>>>>>>>>>>>>>> as Parquet can >>>>>>>>>>>>>>>>>>>> become a bottleneck, even when only a subset of columns is >>>>>>>>>>>>>>>>>>>> queried. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> A common approach to mitigate this is to split the data >>>>>>>>>>>>>>>>>>>> across multiple Parquet files. With the upcoming File >>>>>>>>>>>>>>>>>>>> Format API, we could >>>>>>>>>>>>>>>>>>>> introduce a layer that combines these files into a single >>>>>>>>>>>>>>>>>>>> iterator, >>>>>>>>>>>>>>>>>>>> enabling efficient reading of wide and very wide tables. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> To support this, we would need to revise the metadata >>>>>>>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we >>>>>>>>>>>>>>>>>>>> could introduce a >>>>>>>>>>>>>>>>>>>> _files column containing: >>>>>>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each >>>>>>>>>>>>>>>>>>>> file >>>>>>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Has there been any prior discussion around this idea? >>>>>>>>>>>>>>>>>>>> Is anyone else interested in exploring this further? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>>>> Peter >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>> >>>