> I would hazard that simply storing statistics separately might
> be sufficient for the wide column use-cases, without requiring
> switching to something like flatbuffers?

I agree with Raphael. Column chunks and pages can be referenced by
offset and length. To avoid compatibility issues, we can duplicate a copy
of ColumnMetaData and store them separately somewhere in the
file to facilitate random access.

> For many applications (again ML workloads), people want
> fast random access to smaller chunks of data. Think
> single-digit MB. Parsing the Thrift metadata for the
> ENTIRE file every time you want to read a row group is
> going to be way too expensive.

Small row group size does lead to high overhead of decoding thrift
metadata. With the help of page index, we can always have random
access to pages (provided page size and boundary are well crafted).
We can also duplicate column metadata in a splittable manner like
I mentioned above to avoid full metadata consumption.

Best,
Gang






On Wed, May 15, 2024 at 3:20 AM Martin Loncaric <m.w.lonca...@gmail.com>
wrote:

> I think Parquet's metadata and encoding/compression setup are problematic,
> but I don't see a reason to make Parquet V3 if it's just going to be
> another BtrBlocks or Nimble look-alike.
>
> Some people in the thread have expressed the view that Parquet's metadata
> is fine, and that people can achieve good performance with it. I disagree:
>
> * First of all, many applications cannot persist an in-memory index of the
> metadata. That's only really possible in data warehouses. ML training and
> batch processing jobs need to start loading data cold. For large jobs, each
> worker will process only a small fraction of the data, so caching will be
> useless and hog memory. If your solution is for the user to produce a
> leaner index file themselves, that's not really a solution.
> * Lots of data stores are backed by SSD instead of HDDs, and the website
> recommendation of 512-1024MiB per row group is not always appealing. For
> many applications (again ML workloads), people want fast random access to
> smaller chunks of data. Think single-digit MB. Parsing the Thrift metadata
> for the ENTIRE file every time you want to read a row group is going to be
> way too expensive.
>
> These new formats undeniably have a point, but what would the ecosystem
> gain by adding a Parquet V3?
>
> On Tue, May 14, 2024, 12:48 Julien Le Dem <jul...@apache.org> wrote:
>
> > +1 on Micah starting a doc and following up by commenting in it.
> >
> > @Raphael, Wish Maple: agreed that changing the metadata representation is
> > less important. Most engines can externalize and index metadata in some
> > way. It is an option to propose a standard way to do it without changing
> > the format. Adding new encodings or make existing encodings more
> > parallelizable is something that needs to be in the format and more
> useful.
> >
> > On Tue, May 14, 2024 at 9:26 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > > On Mon, 13 May 2024 16:10:24 +0100
> > > Raphael Taylor-Davies
> > > <r.taylordav...@googlemail.com.INVALID>
> > > wrote:
> > > >
> > > > I guess I wonder if rather than having a parquet format version 2, or
> > > > even a parquet format version 3, we could just document what
> features a
> > > > given parquet implementation actually supports. I believe Andrew
> > intends
> > > > to pick up on where previous efforts here left off.
> > >
> > > I also believe documenting implementation status is strongly desirable,
> > > regardless of whether the discussion on "V3" goes anywhere.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
>

Reply via email to