Hi everyone,

We have been investigating a wide table format internally for a similar use 
case, i.e. we have wide ML tables with features generated by different 
pipelines and teams but want a unified view of the data. We are comparing that 
against separate tables joined together using a shuffle-less join (e.g. storage 
partition join), along with a corresponding view.

The join/view approach seems to give us much of we need, with some added 
benefits like splitting up the metadata, fewer commit conflicts, and ability to 
share, nest, and swap "column families". The downsides are table management is 
split across multiple tables, it requires engine support of shuffle-less joins 
for best performance, and even then, scans probably won't be as optimal.

I'm curious if anyone had further thoughts on the two?

-Bryan



> On May 29, 2025, at 8:18 AM, Péter Váry <peter.vary.apa...@gmail.com> wrote:
> 
> I received feedback from Alkis regarding their Parquet optimization work. 
> Their internal testing shows promising results for reducing metadata size and 
> improving parsing performance. They plan to formalize a proposal for these 
> Parquet enhancements in the near future.
> 
> Meanwhile, I'm putting together our horizontal sharding proposal as a 
> complementary approach. Even with the Parquet metadata improvements, 
> horizontal sharding would provide additional benefits for:
> More efficient column-level updates
> Streamlined column additions
> Better handling of dominant columns that can cause RowGroup size imbalances 
> (placing these in separate files could significantly improve performance)
> Thanks, Peter
> 
> 
> 
> Péter Váry <peter.vary.apa...@gmail.com <mailto:peter.vary.apa...@gmail.com>> 
> ezt írta (időpont: 2025. máj. 28., Sze, 15:39):
>> I would be happy to put together a proposal based on the inputs got here.
>> 
>> Thanks everyone for your thoughts!
>> I will try to incorporate all of this.
>> 
>> Thanks, Peter 
>> 
>> Daniel Weeks <dwe...@apache.org <mailto:dwe...@apache.org>> ezt írta 
>> (időpont: 2025. máj. 27., K, 20:07):
>>> I feel like we have two different issues we're talking about here that 
>>> aren't necessarily tied (though solutions may address both): 1) wide 
>>> tables, 2) adding columns
>>> 
>>> Wide tables are definitely a problem where parquet has limitations. I'm 
>>> optimistic about the ongoing work to help improve parquet footers/stats in 
>>> this area that Fokko mentioned.  There are always limitations in how this 
>>> scales as wide rows lead to small row groups and the cost to reconstitute a 
>>> row gets more expensive, but for cases that are read heavy and projecting 
>>> subsets of columns should significantly improve performance.
>>> 
>>> Adding columns to an existing dataset is something that comes up 
>>> periodically, but there's a lot of complexity involved in this.  Parquet 
>>> does support referencing columns in separate files per the spec, but 
>>> there's no implementation that takes advantage of this to my knowledge.  
>>> This does allow for approaches where you separate/rewrite just the footers 
>>> or various other tricks, but these approaches get complicated quickly and 
>>> the number of readers that can consume those representations would 
>>> initially be very limited.
>>> 
>>> A larger problem for splitting columns across files is that there are a lot 
>>> of assumptions about how data is laid out in both readers and writers.  For 
>>> example, aligning row groups and correctly handling split calculation is 
>>> very complicated if you're trying to split rows across files.  Other 
>>> features are also impacted like deletes, which reference the file to which 
>>> they apply and would need to account for deletes applying to multiple files 
>>> and needing to update those references if columns are added.
>>> 
>>> I believe there are a lot of interesting approaches to addressing these use 
>>> cases, but we'd really need a thorough proposal that explores all of these 
>>> scenarios.  The last thing we would want is to introduce incompatibilities 
>>> within the format that result in incompatible features.
>>> 
>>> -Dan
>>> 
>>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer <russell.spit...@gmail.com 
>>> <mailto:russell.spit...@gmail.com>> wrote:
>>>> Point definitely taken. We really should probably POC some of these ideas 
>>>> and see what we are actually dealing with. (He said without volunteering 
>>>> to do the work :P)
>>>> 
>>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya 
>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>> Yes having to rewrite the whole file is not ideal but I believe most of 
>>>>> the cost of rewriting a file comes from decompression, encoding, stats 
>>>>> calculations etc. If you are adding new values for some columns but are 
>>>>> keeping the rest of the columns the same in the file, then a bunch of 
>>>>> rewrite cost can be optimized away. I am not saying this is better than 
>>>>> writing to a separate file, I am not sure how much worse it is though.
>>>>> 
>>>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer 
>>>>> <russell.spit...@gmail.com <mailto:russell.spit...@gmail.com>> wrote:
>>>>>> I think that "after the fact" modification is one of the requirements 
>>>>>> here, IE: Updating a single column without rewriting the whole file. 
>>>>>> If we have to write new metadata for the file aren't we in the same boat 
>>>>>> as having to rewrite the whole file?
>>>>>> 
>>>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya 
>>>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>>>> If files represent column projections of a table rather than the whole 
>>>>>>> columns in the table, then any read that reads across these files needs 
>>>>>>> to identify what constitutes a row. Lance DB for example has vertical 
>>>>>>> partitioning across columns but also horizontal partitioning across 
>>>>>>> rows such that in each horizontal partitioning(fragment), the same 
>>>>>>> number of rows exist in each vertical partition,  which I think is 
>>>>>>> necessary to make whole/partial row construction cheap. If this is the 
>>>>>>> case, there is no reason not to achieve the same data layout inside a 
>>>>>>> single columnar file with a lean header. I think the only valid 
>>>>>>> argument for a separate file is adding a new set of columns to an 
>>>>>>> existing table, but even then I am not sure a separate file is 
>>>>>>> absolutely necessary for good performance.
>>>>>>> 
>>>>>>> Selcuk
>>>>>>> 
>>>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith 
>>>>>>> <devinsm...@deephaven.io.invalid> wrote:
>>>>>>>> There's a `file_path` field in the parquet ColumnChunk structure, 
>>>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>>>>>>>> 
>>>>>>>> I'm not sure what tooling actually supports this though. Could be 
>>>>>>>> interesting to see what the history of this is. 
>>>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3, 
>>>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>>>>>>>> 
>>>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer 
>>>>>>>> <russell.spit...@gmail.com <mailto:russell.spit...@gmail.com>> wrote:
>>>>>>>>> I have to agree that while there can be some fixes in Parquet, we 
>>>>>>>>> fundamentally need a way to split a "row group" 
>>>>>>>>> or something like that between separate files. If that's something we 
>>>>>>>>> can do in the parquet project that would be great
>>>>>>>>> but it feels like we need to start exploring more drastic options 
>>>>>>>>> than footer encoding.
>>>>>>>>> 
>>>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com 
>>>>>>>>> <mailto:ust...@gmail.com>> wrote:
>>>>>>>>>> I agree with Steven that there are limitations that Parquet cannot 
>>>>>>>>>> do.
>>>>>>>>>> 
>>>>>>>>>> In addition to adding new columns by rewriting all files, files of 
>>>>>>>>>> wide tables may suffer from bad performance like below:
>>>>>>>>>> - Poor compression of row groups because there are too many columns 
>>>>>>>>>> and even a small number of rows can reach the row group threshold.
>>>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size of a 
>>>>>>>>>> row group, leading to unbalanced column chunks and deteriorate the 
>>>>>>>>>> row group compression.
>>>>>>>>>> - Similar to adding new columns, partial update also requires 
>>>>>>>>>> rewriting all columns of the affected rows.
>>>>>>>>>> 
>>>>>>>>>> IIRC, some table formats already support splitting columns into 
>>>>>>>>>> different files:
>>>>>>>>>> - Lance manifest splits a fragment [1] into one or more data files.
>>>>>>>>>> - Apache Hudi has the concept of column family [2].
>>>>>>>>>> - Apache Paimon supports sequence groups [3] for partial update.
>>>>>>>>>> 
>>>>>>>>>> Although Parquet can introduce the concept of logical file and 
>>>>>>>>>> physical file to manage the columns to file mapping, this looks like 
>>>>>>>>>> yet another manifest file design which duplicates the purpose of 
>>>>>>>>>> Iceberg. 
>>>>>>>>>> These might be something worth exploring in Iceberg.
>>>>>>>>>> 
>>>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments
>>>>>>>>>> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
>>>>>>>>>> [3] 
>>>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Gang
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com 
>>>>>>>>>> <mailto:stevenz...@gmail.com>> wrote:
>>>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly 
>>>>>>>>>>> addressing the read performance due to bloated metadata. 
>>>>>>>>>>> 
>>>>>>>>>>> What Peter described in the description seems useful for some ML 
>>>>>>>>>>> workload of feature engineering. A new set of features/columns are 
>>>>>>>>>>> added to the table. Currently, Iceberg  would require rewriting all 
>>>>>>>>>>> data files to combine old and new columns (write amplification). 
>>>>>>>>>>> Similarly, in the past the community also talked about the use 
>>>>>>>>>>> cases of updating a single column, which would require rewriting 
>>>>>>>>>>> all data files.
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry 
>>>>>>>>>>> <peter.vary.apa...@gmail.com <mailto:peter.vary.apa...@gmail.com>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Do you have the link at hand for the thread where this was 
>>>>>>>>>>>> discussed on the Parquet list?
>>>>>>>>>>>> The docs seem quite old, and the PR stale, so I would like to 
>>>>>>>>>>>> understand the situation better.
>>>>>>>>>>>> If it is possible to do this in Parquet, that would be great, but 
>>>>>>>>>>>> Avro, ORC would still suffer.
>>>>>>>>>>>> 
>>>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com <mailto:2am...@gmail.com>> ezt 
>>>>>>>>>>>> írta (időpont: 2025. máj. 26., H, 22:07):
>>>>>>>>>>>>> Hey Peter,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with Fokko; 
>>>>>>>>>>>>> the issue of wide tables leading to Parquet metadata bloat and 
>>>>>>>>>>>>> poor Thrift deserialization performance is a long standing issue 
>>>>>>>>>>>>> that I believe there's motivation in the community to address. So 
>>>>>>>>>>>>> to me it seems better to address it in Parquet itself rather than 
>>>>>>>>>>>>> Iceberg library facilitate a pattern which works around the 
>>>>>>>>>>>>> limitations.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong 
>>>>>>>>>>>>> <fo...@apache.org <mailto:fo...@apache.org>> wrote:
>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense to fix 
>>>>>>>>>>>>>> this in Parquet itself? It has been a long-running issue on 
>>>>>>>>>>>>>> Parquet, and there is still active interest from the community. 
>>>>>>>>>>>>>> There is a PR to replace the footer with FlatBuffers, which 
>>>>>>>>>>>>>> dramatically improves performance 
>>>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The underlying 
>>>>>>>>>>>>>> proposal can be found here 
>>>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa>.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>> Fokko
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou 
>>>>>>>>>>>>>> <yunzou.colost...@gmail.com <mailto:yunzou.colost...@gmail.com>>:
>>>>>>>>>>>>>>> +1, I am really interested in this topic. Performance has 
>>>>>>>>>>>>>>> always been a problem when dealing with wide tables, not just 
>>>>>>>>>>>>>>> read/write, but also during compilation. Most of the ML use 
>>>>>>>>>>>>>>> cases typically exhibit a vectorized read/write pattern, I am 
>>>>>>>>>>>>>>> also wondering if there is any way at the metadata level to 
>>>>>>>>>>>>>>> help the whole compilation and execution process. I do not have 
>>>>>>>>>>>>>>> any answer fo this yet, but I would be really interested in 
>>>>>>>>>>>>>>> exploring this further. 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>> Yun
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang 
>>>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote:
>>>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I am 
>>>>>>>>>>>>>>>> curious if there is a similar story on the write side as well 
>>>>>>>>>>>>>>>> (how to generate these splitted files) and specifically, are 
>>>>>>>>>>>>>>>> you targeting feature backfill use cases in ML use?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry 
>>>>>>>>>>>>>>>> <peter.vary.apa...@gmail.com 
>>>>>>>>>>>>>>>> <mailto:peter.vary.apa...@gmail.com>> wrote:
>>>>>>>>>>>>>>>>> Hi Team,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter 
>>>>>>>>>>>>>>>>> tables with a very high number of columns - sometimes even in 
>>>>>>>>>>>>>>>>> the range of several thousand. I've seen cases with up to 
>>>>>>>>>>>>>>>>> 15,000 columns. Storing such wide tables in a single Parquet 
>>>>>>>>>>>>>>>>> file is often suboptimal, as Parquet can become a bottleneck, 
>>>>>>>>>>>>>>>>> even when only a subset of columns is queried.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> A common approach to mitigate this is to split the data 
>>>>>>>>>>>>>>>>> across multiple Parquet files. With the upcoming File Format 
>>>>>>>>>>>>>>>>> API, we could introduce a layer that combines these files 
>>>>>>>>>>>>>>>>> into a single iterator, enabling efficient reading of wide 
>>>>>>>>>>>>>>>>> and very wide tables.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> To support this, we would need to revise the metadata 
>>>>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we 
>>>>>>>>>>>>>>>>> could introduce a _files column containing:
>>>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each file
>>>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Has there been any prior discussion around this idea?
>>>>>>>>>>>>>>>>> Is anyone else interested in exploring this further?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>> Peter

Reply via email to