+1, I am really interested in this topic. Performance has always been a
problem when dealing with wide tables, not just read/write, but also during
compilation. Most of the ML use cases typically exhibit a vectorized
read/write pattern, I am also wondering if there is any way at the metadata
level to help the whole compilation and execution process. I do not have
any answer fo this yet, but I would be really interested in exploring this
further.

Best Regards,
Yun

On Mon, May 26, 2025 at 9:14 AM Pucheng Yang <py...@pinterest.com.invalid>
wrote:

> Hi Peter, I am interested in this proposal. What's more, I am curious if
> there is a similar story on the write side as well (how to generate these
> splitted files) and specifically, are you targeting feature backfill use
> cases in ML use?
>
> On Mon, May 26, 2025 at 6:29 AM Péter Váry <peter.vary.apa...@gmail.com>
> wrote:
>
>> Hi Team,
>>
>> In machine learning use-cases, it's common to encounter tables with a
>> very high number of columns - sometimes even in the range of several
>> thousand. I've seen cases with up to 15,000 columns. Storing such wide
>> tables in a single Parquet file is often suboptimal, as Parquet can become
>> a bottleneck, even when only a subset of columns is queried.
>>
>> A common approach to mitigate this is to split the data across multiple
>> Parquet files. With the upcoming File Format API, we could introduce a
>> layer that combines these files into a single iterator, enabling efficient
>> reading of wide and very wide tables.
>>
>> To support this, we would need to revise the metadata specification.
>> Instead of the current `_file` column, we could introduce a _files column
>> containing:
>> - `_file_column_ids`: the column IDs present in each file
>> - `_file_path`: the path to the corresponding file
>>
>> Has there been any prior discussion around this idea?
>> Is anyone else interested in exploring this further?
>>
>> Best regards,
>> Peter
>>
>

Reply via email to