I think Parquet's metadata and encoding/compression setup are problematic, but I don't see a reason to make Parquet V3 if it's just going to be another BtrBlocks or Nimble look-alike.
Some people in the thread have expressed the view that Parquet's metadata is fine, and that people can achieve good performance with it. I disagree: * First of all, many applications cannot persist an in-memory index of the metadata. That's only really possible in data warehouses. ML training and batch processing jobs need to start loading data cold. For large jobs, each worker will process only a small fraction of the data, so caching will be useless and hog memory. If your solution is for the user to produce a leaner index file themselves, that's not really a solution. * Lots of data stores are backed by SSD instead of HDDs, and the website recommendation of 512-1024MiB per row group is not always appealing. For many applications (again ML workloads), people want fast random access to smaller chunks of data. Think single-digit MB. Parsing the Thrift metadata for the ENTIRE file every time you want to read a row group is going to be way too expensive. These new formats undeniably have a point, but what would the ecosystem gain by adding a Parquet V3? On Tue, May 14, 2024, 12:48 Julien Le Dem <jul...@apache.org> wrote: > +1 on Micah starting a doc and following up by commenting in it. > > @Raphael, Wish Maple: agreed that changing the metadata representation is > less important. Most engines can externalize and index metadata in some > way. It is an option to propose a standard way to do it without changing > the format. Adding new encodings or make existing encodings more > parallelizable is something that needs to be in the format and more useful. > > On Tue, May 14, 2024 at 9:26 AM Antoine Pitrou <anto...@python.org> wrote: > > > On Mon, 13 May 2024 16:10:24 +0100 > > Raphael Taylor-Davies > > <r.taylordav...@googlemail.com.INVALID> > > wrote: > > > > > > I guess I wonder if rather than having a parquet format version 2, or > > > even a parquet format version 3, we could just document what features a > > > given parquet implementation actually supports. I believe Andrew > intends > > > to pick up on where previous efforts here left off. > > > > I also believe documenting implementation status is strongly desirable, > > regardless of whether the discussion on "V3" goes anywhere. > > > > Regards > > > > Antoine. > > > > > > >