hi all, Just to add some of my perspective (and I would like to write up some longer form thoughts since I've been collaborating / talking with the Nimble and Lance folks -- and as a result I know a lot about the details of Nimble, BtrBlocks, and also the recent Bullion research format from UMD/ByteDance -- and I've been consulting/advising some of the research work that's been referenced ).
Firstly, I 100% agree that documenting implementation support and details, and cross-compatibility is essential. It would have been better for Parquet to have integration tests from day one between Impala and parquet-mr, but this never happened and so there was some initial impedance mismatch between the two halves of the initial Parquet community going back to the early days. When I started working on Parquet in 2015, the motivation was mainly to fill the urgent need to be able to read these files from C++ for use in Python (and eventually R and other C++ consuming languages). As far as the issues in Parquet: - The all-or-nothing footer decoding for datasets with large schemas or many row groups has always been problematic (I've been asked to present quantitative evidence to support this "problematic" statement so I will try to make some!). So I think any work that does not make it much cheaper to read a single column from a single row group is very nearly dead on arrival -- I am not sure how you fully make this problem go away in generality without doing away with Thrift at the footer level, but at that point you are making such a disruptive change that why not try to fix some other problems as well? If you go down that rabbit hole, you have created a new file format that is no longer Parquet, and so calling it ParquetV3 is probably misleading. - Parquet's data page format has worked well over time, but aside from fixing the metadata overhead issue, the data page itself needs to be extensible. There is DATA_PAGE_V2, but structurally it is the same as DATA_PAGE{_V1} with the repetition and definition levels kept outside of the compressed portion. You can kind of think of Parquet's data page structure as one possible choice of options in a general purpose nested encoding scheme (most implementations do dictionary+rle and falls back on plain encoding when the dictionary exceeds a certain size). We could create a DATA_PAGE_V3 that allows for an whole alternate -- and even pluggable -- encoding scheme, without changing the metadata, and this would be valuable to the Parquet community, even if most mainstream Parquet users (e.g. Spark) will opt not to use it for a period of some years for compatibility reasons. - Another problem that I haven't seen mentioned but maybe I just missed it is that Parquet is very painful to decode on accelerators like GPUs. RAPIDS has created a CUDA implementation of Parquet decoding (including decoding the Thrift data page headers on the GPU), but there are two primary problems 1) there is metadata that is necessary for control flow on the host side within the ColumnChunk in the row group and 2) there are not sufficient memory preallocation hints -- how much memory you need to allocate to fully decode a data page. This is also discussed in https://github.com/facebookincubator/nimble/discussions/50 Personally, I struggle to see how the metadata issues are fixable -- at least in a satisfactory fashion where we could get behind calling something ParquetV3 when it would basically be a new file format masquerading as a major version of an existing file format. It also adds a lot of implementation complexity for anyone setting out to support "Parquet". I think there is significant value in developing + researching accelerated "codecs" (basically, new data page formats -- think about how h.264 and h.265 have superseded MPEG-2 in video encoding) and finding a way to incorporate them into Parquet, e.g. with a new DATA_PAGE_V3 page type or similar. It would be ideal for Parquet and its implementations to continue to improve. That said, it's unclear that Parquet as a file container for encoded data can be evolved to satisfactorily resolve all of the above issues, and I don't think it needs to. It seems inevitable that we will end up with new file containers and implementations, but the ideal scenario would be to develop reusable "codec" libraries (like the nested encoding scheme in Nimble or in BtrBlocks -- they're very similar) and then use them in multiple places. Anyway, it's good to see many opinions on this and I look forward to continued dialogue. Thanks Wes On Wed, May 15, 2024 at 7:56 AM Steve Loughran <ste...@cloudera.com.invalid> wrote: > On Tue, 14 May 2024 at 17:48, Julien Le Dem <jul...@apache.org> wrote: > > > +1 on Micah starting a doc and following up by commenting in it. > > > > +maybe some conf call where people of interest can talk about it. > > > > > > > @Raphael, Wish Maple: agreed that changing the metadata representation is > > less important. Most engines can externalize and index metadata in some > > way. > > > works if queries against specific tables are always routed to those > servers, the indices fit in memory and the servers stay up. once things > become more agile that doesn't hold any more. > > This is why I've not investigated the idea of having the filesystem > connector (s3a, abfs...) cache footers to local fs across multiple > streams/between opening files, even as they now all move to support some > form of footer caching to boost ORC/Parquet performance for apps which seek > to the end repeatedly. The larger the worker pool: lower probability of > reuse; the more files you have the more space any caching takes up. > > > > It is an option to propose a standard way to do it without changing > > the format. > > > +1 > > > > Adding new encodings or make existing encodings more > > parallelizable is something that needs to be in the format and more > useful. > > > > One of the things I'd like to see from Micah's work is some list of what > new data types and encodings people think are needed. > > > > > > > > > On Tue, May 14, 2024 at 9:26 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > On Mon, 13 May 2024 16:10:24 +0100 > > > Raphael Taylor-Davies > > > <r.taylordav...@googlemail.com.INVALID> > > > wrote: > > > > > > > > I guess I wonder if rather than having a parquet format version 2, or > > > > even a parquet format version 3, we could just document what > features a > > > > given parquet implementation actually supports. I believe Andrew > > intends > > > > to pick up on where previous efforts here left off. > > > > > > I also believe documenting implementation status is strongly desirable, > > > regardless of whether the discussion on "V3" goes anywhere. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > > > >