On Fri, May 30, 2025 at 3:33 PM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> One key advantage of introducing Physical Files is the flexibility to vary
> RowGroup sizes across columns. For instance, wide string columns could
> benefit from smaller RowGroups to reduce memory pressure, while numeric
> columns could use larger RowGroups to improve compression and scan
> efficiency. Rather than enforcing strict row group alignment across all
> columns, we can explore optimizing read split sizes and write-time RowGroup
> sizes independently - striking a balance that maximizes performance and
> storage costs for different data types and queries.
>

That actually sounds very complicated if you want to split file reads in a
distributed system. If you want to read across column groups, then you
always end up over-reading on one of them if they are not aligned.

And aren't Parquet pages already providing these unaligned sizes?

Gang Wu <ust...@gmail.com> ezt írta (időpont: 2025. máj. 30., P, 8:09):
>
>> IMO, the main drawback for the view solution is the complexity of
>> maintaining consistency across tables if we want to use features like time
>> travel, incremental scan, branch & tag, encryption, etc.
>>
>> On Fri, May 30, 2025 at 12:55 PM Bryan Keller <brya...@gmail.com> wrote:
>>
>>> Fewer commit conflicts meaning the tables representing column families
>>> are updated independently, rather than having to serialize commits to a
>>> single table. Perhaps with a wide table solution the commit logic could be
>>> enhanced to support things like concurrent overwrites to independent column
>>> families, but it seems like it would be fairly involved.
>>>
>>>
>>> On May 29, 2025, at 7:16 PM, Steven Wu <stevenz...@gmail.com> wrote:
>>>
>>> Bryan, interesting approach to split horizontally across multiple
>>> tables.
>>>
>>> A few potential down sides
>>> * operational overhead. tables need to be managed consistently and
>>> probably in some coordinated way
>>> * complex read
>>> * maybe fragile to enforce correctness (during join). It is robust to
>>> enforce the stitching correctness at file group level in file reader and
>>> writer if built in the table format.
>>>
>>> > fewer commit conflicts
>>>
>>> Can you elaborate on this one? Are those tables populated by streaming
>>> or batch pipelines?
>>>
>>> On Thu, May 29, 2025 at 5:03 PM Bryan Keller <brya...@gmail.com> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> We have been investigating a wide table format internally for a similar
>>>> use case, i.e. we have wide ML tables with features generated by different
>>>> pipelines and teams but want a unified view of the data. We are comparing
>>>> that against separate tables joined together using a shuffle-less join
>>>> (e.g. storage partition join), along with a corresponding view.
>>>>
>>>> The join/view approach seems to give us much of we need, with some
>>>> added benefits like splitting up the metadata, fewer commit conflicts, and
>>>> ability to share, nest, and swap "column families". The downsides are table
>>>> management is split across multiple tables, it requires engine support of
>>>> shuffle-less joins for best performance, and even then, scans probably
>>>> won't be as optimal.
>>>>
>>>> I'm curious if anyone had further thoughts on the two?
>>>>
>>>> -Bryan
>>>>
>>>>
>>>>
>>>> On May 29, 2025, at 8:18 AM, Péter Váry <peter.vary.apa...@gmail.com>
>>>> wrote:
>>>>
>>>> I received feedback from Alkis regarding their Parquet optimization
>>>> work. Their internal testing shows promising results for reducing metadata
>>>> size and improving parsing performance. They plan to formalize a proposal
>>>> for these Parquet enhancements in the near future.
>>>>
>>>> Meanwhile, I'm putting together our horizontal sharding proposal as a
>>>> complementary approach. Even with the Parquet metadata improvements,
>>>> horizontal sharding would provide additional benefits for:
>>>>
>>>>    - More efficient column-level updates
>>>>    - Streamlined column additions
>>>>    - Better handling of dominant columns that can cause RowGroup size
>>>>    imbalances (placing these in separate files could significantly improve
>>>>    performance)
>>>>
>>>> Thanks, Peter
>>>>
>>>>
>>>>
>>>> Péter Váry <peter.vary.apa...@gmail.com> ezt írta (időpont: 2025. máj.
>>>> 28., Sze, 15:39):
>>>>
>>>>> I would be happy to put together a proposal based on the inputs got
>>>>> here.
>>>>>
>>>>> Thanks everyone for your thoughts!
>>>>> I will try to incorporate all of this.
>>>>>
>>>>> Thanks, Peter
>>>>>
>>>>> Daniel Weeks <dwe...@apache.org> ezt írta (időpont: 2025. máj. 27.,
>>>>> K, 20:07):
>>>>>
>>>>>> I feel like we have two different issues we're talking about here
>>>>>> that aren't necessarily tied (though solutions may address both): 1) wide
>>>>>> tables, 2) adding columns
>>>>>>
>>>>>> Wide tables are definitely a problem where parquet has limitations.
>>>>>> I'm optimistic about the ongoing work to help improve parquet 
>>>>>> footers/stats
>>>>>> in this area that Fokko mentioned.  There are always limitations in how
>>>>>> this scales as wide rows lead to small row groups and the cost to
>>>>>> reconstitute a row gets more expensive, but for cases that are read heavy
>>>>>> and projecting subsets of columns should significantly improve 
>>>>>> performance.
>>>>>>
>>>>>> Adding columns to an existing dataset is something that comes up
>>>>>> periodically, but there's a lot of complexity involved in this.  Parquet
>>>>>> does support referencing columns in separate files per the spec, but
>>>>>> there's no implementation that takes advantage of this to my knowledge.
>>>>>> This does allow for approaches where you separate/rewrite just the 
>>>>>> footers
>>>>>> or various other tricks, but these approaches get complicated quickly and
>>>>>> the number of readers that can consume those representations would
>>>>>> initially be very limited.
>>>>>>
>>>>>> A larger problem for splitting columns across files is that there are
>>>>>> a lot of assumptions about how data is laid out in both readers and
>>>>>> writers.  For example, aligning row groups and correctly handling split
>>>>>> calculation is very complicated if you're trying to split rows across
>>>>>> files.  Other features are also impacted like deletes, which reference 
>>>>>> the
>>>>>> file to which they apply and would need to account for deletes applying 
>>>>>> to
>>>>>> multiple files and needing to update those references if columns are 
>>>>>> added.
>>>>>>
>>>>>> I believe there are a lot of interesting approaches to addressing
>>>>>> these use cases, but we'd really need a thorough proposal that explores 
>>>>>> all
>>>>>> of these scenarios.  The last thing we would want is to introduce
>>>>>> incompatibilities within the format that result in incompatible features.
>>>>>>
>>>>>> -Dan
>>>>>>
>>>>>> On Tue, May 27, 2025 at 10:02 AM Russell Spitzer <
>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>
>>>>>>> Point definitely taken. We really should probably POC some of
>>>>>>> these ideas and see what we are actually dealing with. (He said without
>>>>>>> volunteering to do the work :P)
>>>>>>>
>>>>>>> On Tue, May 27, 2025 at 11:55 AM Selcuk Aya
>>>>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>>>>
>>>>>>>> Yes having to rewrite the whole file is not ideal but I believe
>>>>>>>> most of the cost of rewriting a file comes from decompression, 
>>>>>>>> encoding,
>>>>>>>> stats calculations etc. If you are adding new values for some columns 
>>>>>>>> but
>>>>>>>> are keeping the rest of the columns the same in the file, then a bunch 
>>>>>>>> of
>>>>>>>> rewrite cost can be optimized away. I am not saying this is better than
>>>>>>>> writing to a separate file, I am not sure how much worse it is though.
>>>>>>>>
>>>>>>>> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer <
>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I think that "after the fact" modification is one of the
>>>>>>>>> requirements here, IE: Updating a single column without rewriting the 
>>>>>>>>> whole
>>>>>>>>> file.
>>>>>>>>> If we have to write new metadata for the file aren't we in the
>>>>>>>>> same boat as having to rewrite the whole file?
>>>>>>>>>
>>>>>>>>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
>>>>>>>>> <selcuk....@snowflake.com.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> If files represent column projections of a table rather than the
>>>>>>>>>> whole columns in the table, then any read that reads across these 
>>>>>>>>>> files
>>>>>>>>>> needs to identify what constitutes a row. Lance DB for example has 
>>>>>>>>>> vertical
>>>>>>>>>> partitioning across columns but also horizontal partitioning across 
>>>>>>>>>> rows
>>>>>>>>>> such that in each horizontal partitioning(fragment), the same number 
>>>>>>>>>> of
>>>>>>>>>> rows exist in each vertical partition,  which I think is necessary 
>>>>>>>>>> to make
>>>>>>>>>> whole/partial row construction cheap. If this is the case, there is 
>>>>>>>>>> no
>>>>>>>>>> reason not to achieve the same data layout inside a single columnar 
>>>>>>>>>> file
>>>>>>>>>> with a lean header. I think the only valid argument for a separate 
>>>>>>>>>> file is
>>>>>>>>>> adding a new set of columns to an existing table, but even then I am 
>>>>>>>>>> not
>>>>>>>>>> sure a separate file is absolutely necessary for good performance.
>>>>>>>>>>
>>>>>>>>>> Selcuk
>>>>>>>>>>
>>>>>>>>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith
>>>>>>>>>> <devinsm...@deephaven.io.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> There's a `file_path` field in the parquet ColumnChunk
>>>>>>>>>>> structure,
>>>>>>>>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure what tooling actually supports this though. Could
>>>>>>>>>>> be interesting to see what the history of this is.
>>>>>>>>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
>>>>>>>>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <
>>>>>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I have to agree that while there can be some fixes in Parquet,
>>>>>>>>>>>> we fundamentally need a way to split a "row group"
>>>>>>>>>>>> or something like that between separate files. If that's
>>>>>>>>>>>> something we can do in the parquet project that would be great
>>>>>>>>>>>> but it feels like we need to start exploring more drastic
>>>>>>>>>>>> options than footer encoding.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I agree with Steven that there are limitations that Parquet
>>>>>>>>>>>>> cannot do.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In addition to adding new columns by rewriting all files,
>>>>>>>>>>>>> files of wide tables may suffer from bad performance like below:
>>>>>>>>>>>>> - Poor compression of row groups because there are too many
>>>>>>>>>>>>> columns and even a small number of rows can reach the row group 
>>>>>>>>>>>>> threshold.
>>>>>>>>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size
>>>>>>>>>>>>> of a row group, leading to unbalanced column chunks and 
>>>>>>>>>>>>> deteriorate the row
>>>>>>>>>>>>> group compression.
>>>>>>>>>>>>> - Similar to adding new columns, partial update also requires
>>>>>>>>>>>>> rewriting all columns of the affected rows.
>>>>>>>>>>>>>
>>>>>>>>>>>>> IIRC, some table formats already support splitting columns
>>>>>>>>>>>>> into different files:
>>>>>>>>>>>>> - Lance manifest splits a fragment [1] into one or more data
>>>>>>>>>>>>> files.
>>>>>>>>>>>>> - Apache Hudi has the concept of column family [2].
>>>>>>>>>>>>> - Apache Paimon supports sequence groups [3] for partial
>>>>>>>>>>>>> update.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Although Parquet can introduce the concept of logical file and
>>>>>>>>>>>>> physical file to manage the columns to file mapping, this looks 
>>>>>>>>>>>>> like yet
>>>>>>>>>>>>> another manifest file design which duplicates the purpose of 
>>>>>>>>>>>>> Iceberg.
>>>>>>>>>>>>> These might be something worth exploring in Iceberg.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
>>>>>>>>>>>>> [3]
>>>>>>>>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Gang
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <
>>>>>>>>>>>>> stevenz...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly
>>>>>>>>>>>>>> addressing the read performance due to bloated metadata.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What Peter described in the description seems useful for some
>>>>>>>>>>>>>> ML workload of feature engineering. A new set of 
>>>>>>>>>>>>>> features/columns are added
>>>>>>>>>>>>>> to the table. Currently, Iceberg  would require rewriting all 
>>>>>>>>>>>>>> data files to
>>>>>>>>>>>>>> combine old and new columns (write amplification). Similarly, in 
>>>>>>>>>>>>>> the past
>>>>>>>>>>>>>> the community also talked about the use cases of updating a 
>>>>>>>>>>>>>> single column,
>>>>>>>>>>>>>> which would require rewriting all data files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry <
>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Do you have the link at hand for the thread where this was
>>>>>>>>>>>>>>> discussed on the Parquet list?
>>>>>>>>>>>>>>> The docs seem quite old, and the PR stale, so I would like
>>>>>>>>>>>>>>> to understand the situation better.
>>>>>>>>>>>>>>> If it is possible to do this in Parquet, that would be
>>>>>>>>>>>>>>> great, but Avro, ORC would still suffer.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont:
>>>>>>>>>>>>>>> 2025. máj. 26., H, 22:07):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey Peter,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for bringing this issue up. I think I agree with
>>>>>>>>>>>>>>>> Fokko; the issue of wide tables leading to Parquet metadata 
>>>>>>>>>>>>>>>> bloat and poor
>>>>>>>>>>>>>>>> Thrift deserialization performance is a long standing issue 
>>>>>>>>>>>>>>>> that I believe
>>>>>>>>>>>>>>>> there's motivation in the community to address. So to me it 
>>>>>>>>>>>>>>>> seems better to
>>>>>>>>>>>>>>>> address it in Parquet itself rather than Iceberg library 
>>>>>>>>>>>>>>>> facilitate a
>>>>>>>>>>>>>>>> pattern which works around the limitations.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong <
>>>>>>>>>>>>>>>> fo...@apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense
>>>>>>>>>>>>>>>>> to fix this in Parquet itself? It has been a long-running 
>>>>>>>>>>>>>>>>> issue on Parquet,
>>>>>>>>>>>>>>>>> and there is still active interest from the community. There 
>>>>>>>>>>>>>>>>> is a PR to
>>>>>>>>>>>>>>>>> replace the footer with FlatBuffers, which dramatically
>>>>>>>>>>>>>>>>> improves performance
>>>>>>>>>>>>>>>>> <https://github.com/apache/arrow/pull/43793>. The
>>>>>>>>>>>>>>>>> underlying proposal can be found here
>>>>>>>>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa>
>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Kind regards,
>>>>>>>>>>>>>>>>> Fokko
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou <
>>>>>>>>>>>>>>>>> yunzou.colost...@gmail.com>:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +1, I am really interested in this topic. Performance has
>>>>>>>>>>>>>>>>>> always been a problem when dealing with wide tables, not 
>>>>>>>>>>>>>>>>>> just read/write,
>>>>>>>>>>>>>>>>>> but also during compilation. Most of the ML use cases 
>>>>>>>>>>>>>>>>>> typically exhibit a
>>>>>>>>>>>>>>>>>> vectorized read/write pattern, I am also wondering if there 
>>>>>>>>>>>>>>>>>> is any way at
>>>>>>>>>>>>>>>>>> the metadata level to help the whole compilation and 
>>>>>>>>>>>>>>>>>> execution process. I
>>>>>>>>>>>>>>>>>> do not have any answer fo this yet, but I would be really 
>>>>>>>>>>>>>>>>>> interested in
>>>>>>>>>>>>>>>>>> exploring this further.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>>>>>>> Yun
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang
>>>>>>>>>>>>>>>>>> <py...@pinterest.com.invalid> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more,
>>>>>>>>>>>>>>>>>>> I am curious if there is a similar story on the write side 
>>>>>>>>>>>>>>>>>>> as well (how to
>>>>>>>>>>>>>>>>>>> generate these splitted files) and specifically, are you 
>>>>>>>>>>>>>>>>>>> targeting feature
>>>>>>>>>>>>>>>>>>> backfill use cases in ML use?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry <
>>>>>>>>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Team,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter
>>>>>>>>>>>>>>>>>>>> tables with a very high number of columns - sometimes even 
>>>>>>>>>>>>>>>>>>>> in the range of
>>>>>>>>>>>>>>>>>>>> several thousand. I've seen cases with up to 15,000 
>>>>>>>>>>>>>>>>>>>> columns. Storing such
>>>>>>>>>>>>>>>>>>>> wide tables in a single Parquet file is often suboptimal, 
>>>>>>>>>>>>>>>>>>>> as Parquet can
>>>>>>>>>>>>>>>>>>>> become a bottleneck, even when only a subset of columns is 
>>>>>>>>>>>>>>>>>>>> queried.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> A common approach to mitigate this is to split the data
>>>>>>>>>>>>>>>>>>>> across multiple Parquet files. With the upcoming File 
>>>>>>>>>>>>>>>>>>>> Format API, we could
>>>>>>>>>>>>>>>>>>>> introduce a layer that combines these files into a single 
>>>>>>>>>>>>>>>>>>>> iterator,
>>>>>>>>>>>>>>>>>>>> enabling efficient reading of wide and very wide tables.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> To support this, we would need to revise the metadata
>>>>>>>>>>>>>>>>>>>> specification. Instead of the current `_file` column, we 
>>>>>>>>>>>>>>>>>>>> could introduce a
>>>>>>>>>>>>>>>>>>>> _files column containing:
>>>>>>>>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each
>>>>>>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Has there been any prior discussion around this idea?
>>>>>>>>>>>>>>>>>>>> Is anyone else interested in exploring this further?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>
>>>

Reply via email to