Point definitely taken. We really should probably POC some of these ideas
and see what we are actually dealing with. (He said without volunteering to
do the work :P)

On Tue, May 27, 2025 at 11:55 AM Selcuk Aya
<selcuk....@snowflake.com.invalid> wrote:

> Yes having to rewrite the whole file is not ideal but I believe most of
> the cost of rewriting a file comes from decompression, encoding, stats
> calculations etc. If you are adding new values for some columns but are
> keeping the rest of the columns the same in the file, then a bunch of
> rewrite cost can be optimized away. I am not saying this is better than
> writing to a separate file, I am not sure how much worse it is though.
>
> On Tue, May 27, 2025 at 9:40 AM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> I think that "after the fact" modification is one of the requirements
>> here, IE: Updating a single column without rewriting the whole file.
>> If we have to write new metadata for the file aren't we in the same boat
>> as having to rewrite the whole file?
>>
>> On Tue, May 27, 2025 at 11:27 AM Selcuk Aya
>> <selcuk....@snowflake.com.invalid> wrote:
>>
>>> If files represent column projections of a table rather than the whole
>>> columns in the table, then any read that reads across these files needs to
>>> identify what constitutes a row. Lance DB for example has vertical
>>> partitioning across columns but also horizontal partitioning across rows
>>> such that in each horizontal partitioning(fragment), the same number of
>>> rows exist in each vertical partition,  which I think is necessary to make
>>> whole/partial row construction cheap. If this is the case, there is no
>>> reason not to achieve the same data layout inside a single columnar file
>>> with a lean header. I think the only valid argument for a separate file is
>>> adding a new set of columns to an existing table, but even then I am not
>>> sure a separate file is absolutely necessary for good performance.
>>>
>>> Selcuk
>>>
>>> On Tue, May 27, 2025 at 9:18 AM Devin Smith
>>> <devinsm...@deephaven.io.invalid> wrote:
>>>
>>>> There's a `file_path` field in the parquet ColumnChunk structure,
>>>> https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962
>>>>
>>>> I'm not sure what tooling actually supports this though. Could be
>>>> interesting to see what the history of this is.
>>>> https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
>>>> https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw
>>>>
>>>> On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> I have to agree that while there can be some fixes in Parquet, we
>>>>> fundamentally need a way to split a "row group"
>>>>> or something like that between separate files. If that's something we
>>>>> can do in the parquet project that would be great
>>>>> but it feels like we need to start exploring more drastic options than
>>>>> footer encoding.
>>>>>
>>>>> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> wrote:
>>>>>
>>>>>> I agree with Steven that there are limitations that Parquet cannot do.
>>>>>>
>>>>>> In addition to adding new columns by rewriting all files, files of
>>>>>> wide tables may suffer from bad performance like below:
>>>>>> - Poor compression of row groups because there are too many columns
>>>>>> and even a small number of rows can reach the row group threshold.
>>>>>> - Dominating columns (e.g. blobs) may contribute to 99% size of a row
>>>>>> group, leading to unbalanced column chunks and deteriorate the row group
>>>>>> compression.
>>>>>> - Similar to adding new columns, partial update also requires
>>>>>> rewriting all columns of the affected rows.
>>>>>>
>>>>>> IIRC, some table formats already support splitting columns into
>>>>>> different files:
>>>>>> - Lance manifest splits a fragment [1] into one or more data files.
>>>>>> - Apache Hudi has the concept of column family [2].
>>>>>> - Apache Paimon supports sequence groups [3] for partial update.
>>>>>>
>>>>>> Although Parquet can introduce the concept of logical file and
>>>>>> physical file to manage the columns to file mapping, this looks like yet
>>>>>> another manifest file design which duplicates the purpose of Iceberg.
>>>>>> These might be something worth exploring in Iceberg.
>>>>>>
>>>>>> [1] https://lancedb.github.io/lance/format.html#fragments
>>>>>> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
>>>>>> [3]
>>>>>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>>>>>>
>>>>>> Best,
>>>>>> Gang
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The Parquet metadata proposal (linked by Fokko) is mainly addressing
>>>>>>> the read performance due to bloated metadata.
>>>>>>>
>>>>>>> What Peter described in the description seems useful for some ML
>>>>>>> workload of feature engineering. A new set of features/columns are 
>>>>>>> added to
>>>>>>> the table. Currently, Iceberg  would require rewriting all data files to
>>>>>>> combine old and new columns (write amplification). Similarly, in the 
>>>>>>> past
>>>>>>> the community also talked about the use cases of updating a single 
>>>>>>> column,
>>>>>>> which would require rewriting all data files.
>>>>>>>
>>>>>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry <
>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Do you have the link at hand for the thread where this was
>>>>>>>> discussed on the Parquet list?
>>>>>>>> The docs seem quite old, and the PR stale, so I would like to
>>>>>>>> understand the situation better.
>>>>>>>> If it is possible to do this in Parquet, that would be great, but
>>>>>>>> Avro, ORC would still suffer.
>>>>>>>>
>>>>>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025. máj.
>>>>>>>> 26., H, 22:07):
>>>>>>>>
>>>>>>>>> Hey Peter,
>>>>>>>>>
>>>>>>>>> Thanks for bringing this issue up. I think I agree with Fokko; the
>>>>>>>>> issue of wide tables leading to Parquet metadata bloat and poor Thrift
>>>>>>>>> deserialization performance is a long standing issue that I believe 
>>>>>>>>> there's
>>>>>>>>> motivation in the community to address. So to me it seems better to 
>>>>>>>>> address
>>>>>>>>> it in Parquet itself rather than Iceberg library facilitate a pattern 
>>>>>>>>> which
>>>>>>>>> works around the limitations.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Amogh Jahagirdar
>>>>>>>>>
>>>>>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong <fo...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Peter,
>>>>>>>>>>
>>>>>>>>>> Thanks for bringing this up. Wouldn't it make more sense to fix
>>>>>>>>>> this in Parquet itself? It has been a long-running issue on Parquet, 
>>>>>>>>>> and
>>>>>>>>>> there is still active interest from the community. There is a PR to 
>>>>>>>>>> replace
>>>>>>>>>> the footer with FlatBuffers, which dramatically improves
>>>>>>>>>> performance <https://github.com/apache/arrow/pull/43793>. The
>>>>>>>>>> underlying proposal can be found here
>>>>>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> Kind regards,
>>>>>>>>>> Fokko
>>>>>>>>>>
>>>>>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou <
>>>>>>>>>> yunzou.colost...@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> +1, I am really interested in this topic. Performance has always
>>>>>>>>>>> been a problem when dealing with wide tables, not just read/write, 
>>>>>>>>>>> but also
>>>>>>>>>>> during compilation. Most of the ML use cases typically exhibit a 
>>>>>>>>>>> vectorized
>>>>>>>>>>> read/write pattern, I am also wondering if there is any way at the 
>>>>>>>>>>> metadata
>>>>>>>>>>> level to help the whole compilation and execution process. I do not 
>>>>>>>>>>> have
>>>>>>>>>>> any answer fo this yet, but I would be really interested in 
>>>>>>>>>>> exploring this
>>>>>>>>>>> further.
>>>>>>>>>>>
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> Yun
>>>>>>>>>>>
>>>>>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang
>>>>>>>>>>> <py...@pinterest.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I am
>>>>>>>>>>>> curious if there is a similar story on the write side as well (how 
>>>>>>>>>>>> to
>>>>>>>>>>>> generate these splitted files) and specifically, are you targeting 
>>>>>>>>>>>> feature
>>>>>>>>>>>> backfill use cases in ML use?
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry <
>>>>>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Team,
>>>>>>>>>>>>>
>>>>>>>>>>>>> In machine learning use-cases, it's common to encounter tables
>>>>>>>>>>>>> with a very high number of columns - sometimes even in the range 
>>>>>>>>>>>>> of several
>>>>>>>>>>>>> thousand. I've seen cases with up to 15,000 columns. Storing such 
>>>>>>>>>>>>> wide
>>>>>>>>>>>>> tables in a single Parquet file is often suboptimal, as Parquet 
>>>>>>>>>>>>> can become
>>>>>>>>>>>>> a bottleneck, even when only a subset of columns is queried.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A common approach to mitigate this is to split the data across
>>>>>>>>>>>>> multiple Parquet files. With the upcoming File Format API, we 
>>>>>>>>>>>>> could
>>>>>>>>>>>>> introduce a layer that combines these files into a single 
>>>>>>>>>>>>> iterator,
>>>>>>>>>>>>> enabling efficient reading of wide and very wide tables.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To support this, we would need to revise the metadata
>>>>>>>>>>>>> specification. Instead of the current `_file` column, we could 
>>>>>>>>>>>>> introduce a
>>>>>>>>>>>>> _files column containing:
>>>>>>>>>>>>> - `_file_column_ids`: the column IDs present in each file
>>>>>>>>>>>>> - `_file_path`: the path to the corresponding file
>>>>>>>>>>>>>
>>>>>>>>>>>>> Has there been any prior discussion around this idea?
>>>>>>>>>>>>> Is anyone else interested in exploring this further?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>

Reply via email to