Hi Peter, I am interested in this proposal. What's more, I am curious if
there is a similar story on the write side as well (how to generate these
splitted files) and specifically, are you targeting feature backfill use
cases in ML use?

On Mon, May 26, 2025 at 6:29 AM Péter Váry <peter.vary.apa...@gmail.com>
wrote:

> Hi Team,
>
> In machine learning use-cases, it's common to encounter tables with a very
> high number of columns - sometimes even in the range of several thousand.
> I've seen cases with up to 15,000 columns. Storing such wide tables in a
> single Parquet file is often suboptimal, as Parquet can become a
> bottleneck, even when only a subset of columns is queried.
>
> A common approach to mitigate this is to split the data across multiple
> Parquet files. With the upcoming File Format API, we could introduce a
> layer that combines these files into a single iterator, enabling efficient
> reading of wide and very wide tables.
>
> To support this, we would need to revise the metadata specification.
> Instead of the current `_file` column, we could introduce a _files column
> containing:
> - `_file_column_ids`: the column IDs present in each file
> - `_file_path`: the path to the corresponding file
>
> Has there been any prior discussion around this idea?
> Is anyone else interested in exploring this further?
>
> Best regards,
> Peter
>

Reply via email to