Re: Interest in Parquet V3

Martin Loncaric Tue, 14 May 2024 12:20:55 -0700

I think Parquet's metadata and encoding/compression setup are problematic,
but I don't see a reason to make Parquet V3 if it's just going to be
another BtrBlocks or Nimble look-alike.

Some people in the thread have expressed the view that Parquet's metadata
is fine, and that people can achieve good performance with it. I disagree:

* First of all, many applications cannot persist an in-memory index of the
metadata. That's only really possible in data warehouses. ML training and
batch processing jobs need to start loading data cold. For large jobs, each
worker will process only a small fraction of the data, so caching will be
useless and hog memory. If your solution is for the user to produce a
leaner index file themselves, that's not really a solution.
* Lots of data stores are backed by SSD instead of HDDs, and the website
recommendation of 512-1024MiB per row group is not always appealing. For
many applications (again ML workloads), people want fast random access to
smaller chunks of data. Think single-digit MB. Parsing the Thrift metadata
for the ENTIRE file every time you want to read a row group is going to be
way too expensive.

These new formats undeniably have a point, but what would the ecosystem
gain by adding a Parquet V3?

On Tue, May 14, 2024, 12:48 Julien Le Dem <jul...@apache.org> wrote:

> +1 on Micah starting a doc and following up by commenting in it.
>
> @Raphael, Wish Maple: agreed that changing the metadata representation is
> less important. Most engines can externalize and index metadata in some
> way. It is an option to propose a standard way to do it without changing
> the format. Adding new encodings or make existing encodings more
> parallelizable is something that needs to be in the format and more useful.
>
> On Tue, May 14, 2024 at 9:26 AM Antoine Pitrou <anto...@python.org> wrote:
>
> > On Mon, 13 May 2024 16:10:24 +0100
> > Raphael Taylor-Davies
> > <r.taylordav...@googlemail.com.INVALID>
> > wrote:
> > >
> > > I guess I wonder if rather than having a parquet format version 2, or
> > > even a parquet format version 3, we could just document what features a
> > > given parquet implementation actually supports. I believe Andrew
> intends
> > > to pick up on where previous efforts here left off.
> >
> > I also believe documenting implementation status is strongly desirable,
> > regardless of whether the discussion on "V3" goes anywhere.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>

Re: Interest in Parquet V3

Reply via email to