There's a `file_path` field in the parquet ColumnChunk structure,
https://github.com/apache/parquet-format/blob/apache-parquet-format-2.11.0/src/main/thrift/parquet.thrift#L959-L962

I'm not sure what tooling actually supports this though. Could be
interesting to see what the history of this is.
https://lists.apache.org/thread/rcv1cxndp113shjybfcldh6nq1t3lcq3,
https://lists.apache.org/thread/k5nv310yp315fttcz213l8o0vmnd7vyw

On Tue, May 27, 2025 at 8:59 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I have to agree that while there can be some fixes in Parquet, we
> fundamentally need a way to split a "row group"
> or something like that between separate files. If that's something we can
> do in the parquet project that would be great
> but it feels like we need to start exploring more drastic options than
> footer encoding.
>
> On Mon, May 26, 2025 at 8:42 PM Gang Wu <ust...@gmail.com> wrote:
>
>> I agree with Steven that there are limitations that Parquet cannot do.
>>
>> In addition to adding new columns by rewriting all files, files of wide
>> tables may suffer from bad performance like below:
>> - Poor compression of row groups because there are too many columns and
>> even a small number of rows can reach the row group threshold.
>> - Dominating columns (e.g. blobs) may contribute to 99% size of a row
>> group, leading to unbalanced column chunks and deteriorate the row group
>> compression.
>> - Similar to adding new columns, partial update also requires rewriting
>> all columns of the affected rows.
>>
>> IIRC, some table formats already support splitting columns into different
>> files:
>> - Lance manifest splits a fragment [1] into one or more data files.
>> - Apache Hudi has the concept of column family [2].
>> - Apache Paimon supports sequence groups [3] for partial update.
>>
>> Although Parquet can introduce the concept of logical file and physical
>> file to manage the columns to file mapping, this looks like yet another
>> manifest file design which duplicates the purpose of Iceberg.
>> These might be something worth exploring in Iceberg.
>>
>> [1] https://lancedb.github.io/lance/format.html#fragments
>> [2] https://github.com/apache/hudi/blob/master/rfc/rfc-80/rfc-80.md
>> [3]
>> https://paimon.apache.org/docs/0.9/primary-key-table/merge-engine/partial-update/#sequence-group
>>
>> Best,
>> Gang
>>
>>
>>
>> On Tue, May 27, 2025 at 7:03 AM Steven Wu <stevenz...@gmail.com> wrote:
>>
>>> The Parquet metadata proposal (linked by Fokko) is mainly addressing
>>> the read performance due to bloated metadata.
>>>
>>> What Peter described in the description seems useful for some ML
>>> workload of feature engineering. A new set of features/columns are added to
>>> the table. Currently, Iceberg  would require rewriting all data files to
>>> combine old and new columns (write amplification). Similarly, in the past
>>> the community also talked about the use cases of updating a single column,
>>> which would require rewriting all data files.
>>>
>>> On Mon, May 26, 2025 at 2:42 PM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> Do you have the link at hand for the thread where this was discussed on
>>>> the Parquet list?
>>>> The docs seem quite old, and the PR stale, so I would like to
>>>> understand the situation better.
>>>> If it is possible to do this in Parquet, that would be great, but Avro,
>>>> ORC would still suffer.
>>>>
>>>> Amogh Jahagirdar <2am...@gmail.com> ezt írta (időpont: 2025. máj. 26.,
>>>> H, 22:07):
>>>>
>>>>> Hey Peter,
>>>>>
>>>>> Thanks for bringing this issue up. I think I agree with Fokko; the
>>>>> issue of wide tables leading to Parquet metadata bloat and poor Thrift
>>>>> deserialization performance is a long standing issue that I believe 
>>>>> there's
>>>>> motivation in the community to address. So to me it seems better to 
>>>>> address
>>>>> it in Parquet itself rather than Iceberg library facilitate a pattern 
>>>>> which
>>>>> works around the limitations.
>>>>>
>>>>> Thanks,
>>>>> Amogh Jahagirdar
>>>>>
>>>>> On Mon, May 26, 2025 at 1:42 PM Fokko Driesprong <fo...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> Thanks for bringing this up. Wouldn't it make more sense to fix this
>>>>>> in Parquet itself? It has been a long-running issue on Parquet, and there
>>>>>> is still active interest from the community. There is a PR to replace the
>>>>>> footer with FlatBuffers, which dramatically improves performance
>>>>>> <https://github.com/apache/arrow/pull/43793>. The underlying
>>>>>> proposal can be found here
>>>>>> <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit?tab=t.0#heading=h.atbrz9ch6nfa>
>>>>>> .
>>>>>>
>>>>>> Kind regards,
>>>>>> Fokko
>>>>>>
>>>>>> Op ma 26 mei 2025 om 20:35 schreef yun zou <
>>>>>> yunzou.colost...@gmail.com>:
>>>>>>
>>>>>>> +1, I am really interested in this topic. Performance has always
>>>>>>> been a problem when dealing with wide tables, not just read/write, but 
>>>>>>> also
>>>>>>> during compilation. Most of the ML use cases typically exhibit a 
>>>>>>> vectorized
>>>>>>> read/write pattern, I am also wondering if there is any way at the 
>>>>>>> metadata
>>>>>>> level to help the whole compilation and execution process. I do not have
>>>>>>> any answer fo this yet, but I would be really interested in exploring 
>>>>>>> this
>>>>>>> further.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Yun
>>>>>>>
>>>>>>> On Mon, May 26, 2025 at 9:14 AM Pucheng Yang
>>>>>>> <py...@pinterest.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hi Peter, I am interested in this proposal. What's more, I am
>>>>>>>> curious if there is a similar story on the write side as well (how to
>>>>>>>> generate these splitted files) and specifically, are you targeting 
>>>>>>>> feature
>>>>>>>> backfill use cases in ML use?
>>>>>>>>
>>>>>>>> On Mon, May 26, 2025 at 6:29 AM Péter Váry <
>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Team,
>>>>>>>>>
>>>>>>>>> In machine learning use-cases, it's common to encounter tables
>>>>>>>>> with a very high number of columns - sometimes even in the range of 
>>>>>>>>> several
>>>>>>>>> thousand. I've seen cases with up to 15,000 columns. Storing such wide
>>>>>>>>> tables in a single Parquet file is often suboptimal, as Parquet can 
>>>>>>>>> become
>>>>>>>>> a bottleneck, even when only a subset of columns is queried.
>>>>>>>>>
>>>>>>>>> A common approach to mitigate this is to split the data across
>>>>>>>>> multiple Parquet files. With the upcoming File Format API, we could
>>>>>>>>> introduce a layer that combines these files into a single iterator,
>>>>>>>>> enabling efficient reading of wide and very wide tables.
>>>>>>>>>
>>>>>>>>> To support this, we would need to revise the metadata
>>>>>>>>> specification. Instead of the current `_file` column, we could 
>>>>>>>>> introduce a
>>>>>>>>> _files column containing:
>>>>>>>>> - `_file_column_ids`: the column IDs present in each file
>>>>>>>>> - `_file_path`: the path to the corresponding file
>>>>>>>>>
>>>>>>>>> Has there been any prior discussion around this idea?
>>>>>>>>> Is anyone else interested in exploring this further?
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>

Reply via email to