On Fri, May 30, 2025 at 8:35 PM Péter Váry <peter.vary.apa...@gmail.com> wrote:
> Consider this example > Imagine a table with one large string column and many small numeric > columns. > > Scenario 1: Single File > > - All columns are written into a single file. > - The RowGroup size is small due to the large string column dominating > the layout. > > This is an assumption that may not be necessary. It would be well possible to tune parquet writers to write very large row groups when a large string column dominates. Such a string column would probably not get dictionary encoded anyway, so it would effectively end up with a couple of values per 1MB Parquet page. The other columns would get decent-sized pages, and the overall row group size would be appropriate for getting good compression on those smaller columns. What would be the downside of this approach? - When you're only reading the integer columns it is exactly the same as when the columns would have been in a file by themselves. You just don't read the large column chunk. - I think it adds some complexity to the distributed/parallel reading of the row groups when the large string column is included in the selected set of columns. You know that a row group is very large, so you might then shard it by row ranges. Each parallel reader would have to filter out the rows that weren't assigned to it. With Parquet page skipping, each reader could avoid reading the large-string column pages for rows that weren't assigned to it. Ultimately I think the parallel reading problem here is *nearly* the same regardless of whether you use one XL row group or separate files. You need to know the exact row group / page boundaries within each file in order to decide how to shard the read. And then you need to do row-index-range based skipping on at least *some* of the input columns. - With XL row groups, in order to shard the row group into evenly sized chunks, you need to actually read the parquet footer first, because you need to know the row group boundaries within each file, and ideally even the page boundaries within each row group so that you can align your row ranges with those boundaries. - If you use column-specific files, then you actually need to read the parquet footers of *all the separate column files*. That's 2x the number of I/Os. These I/Os can be done in parallel, but they will contribute to throttling on cloud object stores. So XL row groups distributed read planning can be done in one I/O, while column-specific files require more I/Os. Either that, or you need to store *even more* information in the metadata (namely all of these boundaries). The column-specific files also require more I/Os to read later (because you end up having to read two footers), which adds up especially if you read the large string column which means you parallelize the read into many small chunks. > > - The numeric columns are not compacted efficiently. > > Scenario 2: Column-Specific Files > > - One file is written for the string column, and another for the > numeric columns. > - The RowGroup size for the string column remains small, but the > numeric columns benefit from optimal RowGroup sizing. > > There's a third option, which is to use column-specific files (or groups of columns in a file) that form a single Parquet structure with cross-file references (which is already in the Parquet standard, albeit not implemented anywhere). This approach has several advantages over the other options: 1. All of the metadata required for distributed reads is in one place (one parquet footer), making distributed read planning require fewer I/Os, and reducing the pressure to move all of that information to the table-level metadata as well. 2. Flexible structure. Different files can have different distribution of columns over files, and you don't have to remember the per-file distribution in the metadata. 3. More scalable: you can have a file per column if you want, if your column sizes are wildly variable, without bloating the table-level metadata with information about more files. 4. You can add/replace an entire column just by writing one extra file (with the new column contents, plus a new footer for the entire file that simply points to the old files for the existing data that wasn't modified). 5. Relatively simple to implement in existing Parquet readers compared to "read multiple parquets and zip them together". > Query Performance Impact: > > - If a query only reads one of the numeric columns: > - Scenario 1: Requires reading many small column chunks. > - Scenario 2: Reads a single, continuous column chunk - much more > efficient. > > Queries only reading columns which are stored in a single file will have > improvements. Cross file queries will have over-reading which might, or > might not be balanced out by reading bigger continuous chunks. Full table > scans will definitely have a performance penalty, but that is not the goal > here. > > > And aren't Parquet pages already providing these unaligned sizes? > > Parquet pages do offer some flexibility in size, but they operate at a > lower level and are still bound by the RowGroup structure. What I’m > proposing is a higher-level abstraction that allows us to group columns > into independently optimized Physical Files, each with its own RowGroup > sizing strategy. This could allow us to better optimize for queries where > only a small number of columns are projected from a wide table. > I agree that it's an interesting idea, but it does add a lot of complexity, and I'm not convinced that it's better from a performance standpoint (metadata size increase, more I/Os). If we can get away with a better row group sizing policy, wouldn't that be preferable? > Bart Samwel <b...@databricks.com.invalid> ezt írta (időpont: 2025. máj. > 30., P, 16:03): > >> >> >> On Fri, May 30, 2025 at 3:33 PM Péter Váry <peter.vary.apa...@gmail.com> >> wrote: >> >>> One key advantage of introducing Physical Files is the flexibility to >>> vary RowGroup sizes across columns. For instance, wide string columns could >>> benefit from smaller RowGroups to reduce memory pressure, while numeric >>> columns could use larger RowGroups to improve compression and scan >>> efficiency. Rather than enforcing strict row group alignment across all >>> columns, we can explore optimizing read split sizes and write-time RowGroup >>> sizes independently - striking a balance that maximizes performance and >>> storage costs for different data types and queries. >>> >> >> That actually sounds very complicated if you want to split file reads in >> a distributed system. If you want to read across column groups, then you >> always end up over-reading on one of them if they are not aligned. >> >> And aren't Parquet pages already providing these unaligned sizes? >> >> Gang Wu <ust...@gmail.com> ezt írta (időpont: 2025. máj. 30., P, 8:09): >>> >>>> IMO, the main drawback for the view solution is the complexity of >>>> maintaining consistency across tables if we want to use features like time >>>> travel, incremental scan, branch & tag, encryption, etc. >>>> >>>> On Fri, May 30, 2025 at 12:55 PM Bryan Keller <brya...@gmail.com> >>>> wrote: >>>> >>>>> Fewer commit conflicts meaning the tables representing column families >>>>> are updated independently, rather than having to serialize commits to a >>>>> single table. Perhaps with a wide table solution the commit logic could be >>>>> enhanced to support things like concurrent overwrites to independent >>>>> column >>>>> families, but it seems like it would be fairly involved. >>>>> >>>>> >>>>> On May 29, 2025, at 7:16 PM, Steven Wu <stevenz...@gmail.com> wrote: >>>>> >>>>> Bryan, interesting approach to split horizontally across multiple >>>>> tables. >>>>> >>>>> A few potential down sides >>>>> * operational overhead. tables need to be managed consistently and >>>>> probably in some coordinated way >>>>> * complex read >>>>> * maybe fragile to enforce correctness (during join). It is robust to >>>>> enforce the stitching correctness at file group level in file reader and >>>>> writer if built in the table format. >>>>> >>>>> > fewer commit conflicts >>>>> >>>>> Can you elaborate on this one? Are those tables populated by streaming >>>>> or batch pipelines? >>>>> >>>>> On Thu, May 29, 2025 at 5:03 PM Bryan Keller <brya...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> We have been investigating a wide table format internally for a >>>>>> similar use case, i.e. we have wide ML tables with features generated by >>>>>> different pipelines and teams but want a unified view of the data. We are >>>>>> comparing that against separate tables joined together using a >>>>>> shuffle-less >>>>>> join (e.g. storage partition join), along with a corresponding view. >>>>>> >>>>>> The join/view approach seems to give us much of we need, with some >>>>>> added benefits like splitting up the metadata, fewer commit conflicts, >>>>>> and >>>>>> ability to share, nest, and swap "column families". The downsides are >>>>>> table >>>>>> management is split across multiple tables, it requires engine support of >>>>>> shuffle-less joins for best performance, and even then, scans probably >>>>>> won't be as optimal. >>>>>> >>>>>> I'm curious if anyone had further thoughts on the two? >>>>>> >>>>>> -Bryan >>>>>> >>>>>> >>>>>> >>>>>> On May 29, 2025, at 8:18 AM, Péter Váry <peter.vary.apa...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> I received feedback from Alkis regarding their Parquet optimization >>>>>> work. Their internal testing shows promising results for reducing >>>>>> metadata >>>>>> size and improving parsing performance. They plan to formalize a proposal >>>>>> for these Parquet enhancements in the near future. >>>>>> >>>>>> Meanwhile, I'm putting together our horizontal sharding proposal as a >>>>>> complementary approach. Even with the Parquet metadata improvements, >>>>>> horizontal sharding would provide additional benefits for: >>>>>> >>>>>> - More efficient column-level updates >>>>>> - Streamlined column additions >>>>>> - Better handling of dominant columns that can cause RowGroup >>>>>> size imbalances (placing these in separate files could significantly >>>>>> improve performance) >>>>>> >>>>>> Thanks, Peter >>>>>> >>>>>> >>>>>> >>>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. >>>>>> máj. 28., Sze, 15:39): >>>>>> >>>>>>> I would be happy to put together a proposal based on the inputs got >>>>>>> here. >>>>>>> >>>>>>> Thanks everyone for your thoughts! >>>>>>> I will try to incorporate all of this. >>>>>>> >>>>>>> Thanks, Peter >>>>>>> >>>>>>> Daniel Weeks <dwe...@apache.org> ezt írta (időpont: 2025. máj. 27., >>>>>>> K, 20:07): >>>>>>> >>>>>>>> I feel like we have two different issues we're talking about here >>>>>>>> that aren't necessarily tied (though solutions may address both): 1) >>>>>>>> wide >>>>>>>> tables, 2) adding columns >>>>>>>> >>>>>>>> Wide tables are definitely a problem where parquet has limitations. >>>>>>>> I'm optimistic about the ongoing work to help improve parquet >>>>>>>> footers/stats >>>>>>>> in this area that Fokko mentioned. There are always limitations in how >>>>>>>> this scales as wide rows lead to small row groups and the cost to >>>>>>>> reconstitute a row gets more expensive, but for cases that are read >>>>>>>> heavy >>>>>>>> and projecting subsets of columns should significantly improve >>>>>>>> performance. >>>>>>>> >>>>>>>> Adding columns to an existing dataset is something that comes up >>>>>>>> periodically, but there's a lot of complexity involved in this. >>>>>>>> Parquet >>>>>>>> does support referencing columns in separate files per the spec, but >>>>>>>> there's no implementation that takes advantage of this to my knowledge. >>>>>>>> This does allow for approaches where you separate/rewrite just the >>>>>>>> footers >>>>>>>> or various other tricks, but these approaches get complicated quickly >>>>>>>> and >>>>>>>> the number of readers that can consume those representations would >>>>>>>> initially be very limited. >>>>>>>> >>>>>>>> A larger problem for splitting columns across files is that there >>>>>>>> are a lot of assumptions about how data is laid out in both readers and >>>>>>>> writers. For example, aligning row groups and correctly handling split >>>>>>>> calculation is very complicated if you're trying to split rows across >>>>>>>> files. Other features are also impacted like deletes, which reference >>>>>>>> the >>>>>>>> file to which they apply and would need to account for deletes >>>>>>>> applying to >>>>>>>> multiple files and needing to update those references if columns are >>>>>>>> added. >>>>>>>> >>>>>>>> I believe there are a lot of interesting approaches to addressing >>>>>>>> these use cases, but we'd really need a thorough proposal that >>>>>>>> explores all >>>>>>>> of these scenarios. The last thing we would want is to introduce >>>>>>>> incompatibilities within the format that result in incompatible >>>>>>>> features. >>>>>>>> >>>>>>>> -Dan >>>>>>>> >>>>>>>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer < >>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Point definitely taken. We really should probably POC some of >>>>>>>>> these ideas and see what we are actually dealing with. (He said >>>>>>>>> without >>>>>>>>> volunteering to do the work :P) >>>>>>>>> >>>>>>>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya >>>>>>>>> <selcuk....@snowflake.com.invalid> wrote: >>>>>>>>> >>>>>>>>>> Yes having to rewrite the whole file is not ideal but I believe >>>>>>>>>> most of the cost of rewriting a file comes from decompression, >>>>>>>>>> encoding, >>>>>>>>>> stats calculations etc. If you are adding new values for some >>>>>>>>>> columns but >>>>>>>>>> are keeping the rest of the columns the same in the file, then a >>>>>>>>>> bunch of >>>>>>>>>> rewrite cost can be optimized away. I am not saying this is better >>>>>>>>>> than >>>>>>>>>> writing to a separate file, I am not sure how much worse it is >>>>>>>>>> though. >>>>>>>>>> >>>>>>>>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer < >>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I think that "after the fact" modification is one of the >>>>>>>>>>> requirements here, IE: Updating a single column without rewriting >>>>>>>>>>> the whole >>>>>>>>>>> file. >>>>>>>>>>> If we have to write new metadata for the file aren't we in the >>>>>>>>>>> same boat as having to rewrite the whole file? >>>>>>>>>>> >>>>>>>>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya >>>>>>>>>>> <selcuk....@snowflake.com.invalid> wrote: >>>>>>>>>>> >>>>>>>>>>>> If files represent column projections of a table rather than >>>>>>>>>>>> the whole columns in the table, then any read that reads across >>>>>>>>>>>> these files >>>>>>>>>>>> needs to identify what constitutes a row. Lance DB for example has >>>>>>>>>>>> vertical >>>>>>>>>>>> partitioning across columns but also horizontal partitioning >>>>>>>>>>>> across rows >>>>>>>>>>>> such that in each horizontal partitioning(fragment), the same >>>>>>>>>>>> number of >>>>>>>>>>>> rows exist in each vertical partition, which I think is necessary >>>>>>>>>>>> to make >>>>>>>>>>>> whole/partial row construction cheap. If this is the case, there >>>>>>>>>>>> is no >>>>>>>>>>>> reason not to achieve the same data layout inside a single >>>>>>>>>>>> columnar file >>>>>>>>>>>> with a lean header. I think the only valid argument for a separate >>>>>>>>>>>> file is >>>>>>>>>>>> adding a new set of columns to an existing table, but even then I >>>>>>>>>>>> am not >>>>>>>>>>>> sure a separate file is absolutely necessary for good performance. >>>>>>>>>>>> >>>>>>>>>>>> Selcuk >>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith >>>>>>>>>>>> <devinsm...@deephaven.io.invalid> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> There's a `file_path` field in the parquet ColumnChunk >>>>>>>>>>>>> structure, >>>>>>>>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962 >>>>>>>>>>>>> >>>>>>>>>>>>> I'm not sure what tooling actually supports this though. Could >>>>>>>>>>>>> be interesting to see what the history of this is. >>>>>>>>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3, >>>>>>>>>>>>> >>>>>>>>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer < >>>>>>>>>>>>> russell.spit...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I have to agree that while there can be some fixes in >>>>>>>>>>>>>> Parquet, we fundamentally need a way to split a "row group" >>>>>>>>>>>>>> or something like that between separate files. If that's >>>>>>>>>>>>>> something we can do in the parquet project that would be great >>>>>>>>>>>>>> but it feels like we need to start exploring more drastic >>>>>>>>>>>>>> options than footer encoding. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I agree with Steven that there are limitations that Parquet >>>>>>>>>>>>>>> cannot do. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In addition to adding new columns by rewriting all files, >>>>>>>>>>>>>>> files of wide tables may suffer from bad performance like below: >>>>>>>>>>>>>>> - Poor compression of row groups because there are too many >>>>>>>>>>>>>>> columns and even a small number of rows can reach the row group >>>>>>>>>>>>>>> threshold. >>>>>>>>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size >>>>>>>>>>>>>>> of a row group, leading to unbalanced column chunks and >>>>>>>>>>>>>>> deteriorate the row >>>>>>>>>>>>>>> group compression. >>>>>>>>>>>>>>> - Similar to adding new columns, partial update also >>>>>>>>>>>>>>> requires rewriting all columns of the affected rows. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> IIRC, some table formats already support splitting columns >>>>>>>>>>>>>>> into different files: >>>>>>>>>>>>>>> - Lance manifest splits a fragment [1] into one or more data >>>>>>>>>>>>>>> files. >>>>>>>>>>>>>>> - Apache Hudi has the concept of column family [2]. >>>>>>>>>>>>>>> - Apache Paimon supports sequence groups [3] for partial >>>>>>>>>>>>>>> update. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Although Parquet can introduce the concept of logical file >>>>>>>>>>>>>>> and physical file to manage the columns to file mapping, this >>>>>>>>>>>>>>> looks like >>>>>>>>>>>>>>> yet another manifest file design which duplicates the purpose >>>>>>>>>>>>>>> of Iceberg. >>>>>>>>>>>>>>> These might be something worth exploring in Iceberg. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments >>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>> https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md >>>>>>>>>>>>>>> [3] >>>>>>>>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> Gang >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu < >>>>>>>>>>>>>>> stevenz...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly >>>>>>>>>>>>>>>> addressing the read performance due to bloated metadata. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What Peter described in the description seems useful for >>>>>>>>>>>>>>>> some ML workload of feature engineering. A new set of >>>>>>>>>>>>>>>> features/columns are >>>>>>>>>>>>>>>> added to the table. Currently, Iceberg would require >>>>>>>>>>>>>>>> rewriting all data >>>>>>>>>>>>>>>> files to combine old and new columns (write amplification). >>>>>>>>>>>>>>>> Similarly, in >>>>>>>>>>>>>>>> the past the community also talked about the use cases of >>>>>>>>>>>>>>>> updating a single >>>>>>>>>>>>>>>> column, which would require rewriting all data files. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry < >>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Do you have the link at hand for the thread where this was >>>>>>>>>>>>>>>>> discussed on the Parquet list? >>>>>>>>>>>>>>>>> The docs seem quite old, and the PR stale, so I would like >>>>>>>>>>>>>>>>> to understand the situation better. >>>>>>>>>>>>>>>>> If it is possible to do this in Parquet, that would be >>>>>>>>>>>>>>>>> great, but Avro, ORC would still suffer. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: >>>>>>>>>>>>>>>>> 2025. máj. 26., H, 22:07): >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hey Peter, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with >>>>>>>>>>>>>>>>>> Fokko; the issue of wide tables leading to Parquet metadata >>>>>>>>>>>>>>>>>> bloat and poor >>>>>>>>>>>>>>>>>> Thrift deserialization performance is a long standing issue >>>>>>>>>>>>>>>>>> that I believe >>>>>>>>>>>>>>>>>> there's motivation in the community to address. So to me it >>>>>>>>>>>>>>>>>> seems better to >>>>>>>>>>>>>>>>>> address it in Parquet itself rather than Iceberg library >>>>>>>>>>>>>>>>>> facilitate a >>>>>>>>>>>>>>>>>> pattern which works around the limitations. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> Amogh Jahagirdar >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong < >>>>>>>>>>>>>>>>>> fo...@apache.org> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Peter, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense >>>>>>>>>>>>>>>>>>> to fix this in Parquet itself? It has been a long-running >>>>>>>>>>>>>>>>>>> issue on Parquet, >>>>>>>>>>>>>>>>>>> and there is still active interest from the community. >>>>>>>>>>>>>>>>>>> There is a PR to >>>>>>>>>>>>>>>>>>> replace the footer with FlatBuffers, which dramatically >>>>>>>>>>>>>>>>>>> improves performance >>>>>>>>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The >>>>>>>>>>>>>>>>>>> underlying proposal can be found here >>>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa> >>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Kind regards, >>>>>>>>>>>>>>>>>>> Fokko >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou < >>>>>>>>>>>>>>>>>>> yunzou.colost...@gmail.com>: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> +1, I am really interested in this topic. Performance >>>>>>>>>>>>>>>>>>>> has always been a problem when dealing with wide tables, >>>>>>>>>>>>>>>>>>>> not just >>>>>>>>>>>>>>>>>>>> read/write, but also during compilation. Most of the ML >>>>>>>>>>>>>>>>>>>> use cases typically >>>>>>>>>>>>>>>>>>>> exhibit a vectorized read/write pattern, I am also >>>>>>>>>>>>>>>>>>>> wondering if there is >>>>>>>>>>>>>>>>>>>> any way at the metadata level to help the whole >>>>>>>>>>>>>>>>>>>> compilation and execution >>>>>>>>>>>>>>>>>>>> process. I do not have any answer fo this yet, but I would >>>>>>>>>>>>>>>>>>>> be really >>>>>>>>>>>>>>>>>>>> interested in exploring this further. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>>>>>>> Yun >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang >>>>>>>>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's >>>>>>>>>>>>>>>>>>>>> more, I am curious if there is a similar story on the >>>>>>>>>>>>>>>>>>>>> write side as well >>>>>>>>>>>>>>>>>>>>> (how to generate these splitted files) and specifically, >>>>>>>>>>>>>>>>>>>>> are you targeting >>>>>>>>>>>>>>>>>>>>> feature backfill use cases in ML use? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry < >>>>>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi Team, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> In machine learning use-cases, it's common to >>>>>>>>>>>>>>>>>>>>>> encounter tables with a very high number of columns - >>>>>>>>>>>>>>>>>>>>>> sometimes even in the >>>>>>>>>>>>>>>>>>>>>> range of several thousand. I've seen cases with up to >>>>>>>>>>>>>>>>>>>>>> 15,000 columns. >>>>>>>>>>>>>>>>>>>>>> Storing such wide tables in a single Parquet file is >>>>>>>>>>>>>>>>>>>>>> often suboptimal, as >>>>>>>>>>>>>>>>>>>>>> Parquet can become a bottleneck, even when only a subset >>>>>>>>>>>>>>>>>>>>>> of columns is >>>>>>>>>>>>>>>>>>>>>> queried. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> A common approach to mitigate this is to split the >>>>>>>>>>>>>>>>>>>>>> data across multiple Parquet files. With the upcoming >>>>>>>>>>>>>>>>>>>>>> File Format API, we >>>>>>>>>>>>>>>>>>>>>> could introduce a layer that combines these files into a >>>>>>>>>>>>>>>>>>>>>> single iterator, >>>>>>>>>>>>>>>>>>>>>> enabling efficient reading of wide and very wide tables. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> To support this, we would need to revise the metadata >>>>>>>>>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we >>>>>>>>>>>>>>>>>>>>>> could introduce a >>>>>>>>>>>>>>>>>>>>>> _files column containing: >>>>>>>>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each >>>>>>>>>>>>>>>>>>>>>> file >>>>>>>>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Has there been any prior discussion around this idea? >>>>>>>>>>>>>>>>>>>>>> Is anyone else interested in exploring this further? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>>>>>>>> Peter >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>> >>>>>