As the author of one of these new formats I'll chime in. The main issues I have with parquet are:
A. Pages in a column chunk must be contiguous (this is Lance's biggest issue with parquet) B. Encodings should be extensible C. Flexibility in what is considered data / metadata I outline my reasoning for these in [1] and so I'll avoid repeating that here. I think B has been discussed pretty thoroughly in this thread. As for C, a format should be flexible, and then it is pretty straightforward. If a file is likely to be used in "search" (very selective filters, ability to cache, etc.) then lots of data should be put in the column metadata. If the file is mostly for cold full scans then almost nothing should go in column metadata (either don't write the metadata at all or, I guess, you can put it in the data pages). The format shouldn't force a choice. Personally, I am more excited about A than I am about B & C (though I do think both B & C should be addressed if going through the trouble of a new format). Addressing A lets us get rid of row groups, allows for APIs such as "array-at-a-time writing", lets us make large data pages, and generally leads to more foolproof files. I agree with Andrew that any discussion of B & C is usually based on assumptions rather than concrete measurements of reader performance. In the scattered profiling I've done of parquet-cpp and parquet-rs I've found that poor parquet reader performance typically has very little to do with B & C. Actually, I would guess that the most widespread (though not necessarily most important) obstacle to parquet has been user knowledge. To get the best performance from a reader users need to be familiar not just with the format but also with the features available in a particular reader. I think simplifying the user experience should be a secondary goal for any new changes. At the risk of arrogant self-promotion I would recommend people read [1] for inspiration if nothing else. I'm also hoping to detail design decisions and tradeoffs that we come across (starting in [2] and continuing throughout the summer). [1] https://blog.lancedb.com/lance-v2/ [2] https://blog.lancedb.com/file-readers-in-depth-parallelism-without-row-groups/ On Mon, May 20, 2024 at 11:06 AM Parth Chandra <par...@apache.org> wrote: > Hi Parquet team, > > It is very exciting to see this effort. Thanks Micah for starting this. > > For most use case that our team sees the broad areas for improvement > appear to be - > 1) Optimizing for cloud storage (latency is high, seeks are expensive) > 2) Optimized metadata reading - we've seen 30% (sometimes more) of > Spark's scan operator time spent in reading footers. > 3) Anything that improves support for data lakes. > > Also I'll be happy to help wherever I can. > > Parth > > On Sun, May 19, 2024 at 10:59 AM Xinli shang <sha...@uber.com.invalid> > wrote: > > > Sorry I am late to the party! It's great to see this discussion! Thank > you > > everyone for the many good points and thank you, Micah, for starting the > > discussion and putting it together into a document, which is very > helpful! > > I agree with most of the points we discussed above, and we need to > improve > > Parquet and sometimes even speed up to catch up with industry changes. > > > > With that said, we need people to work on it, as Julien mentioned. The > > document > > < > > > https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit > > > > > that Micah created covers pretty much everything we discussed here. I > > encourage all of us to contribute by raising questions, providing > > suggestions, adding missing functionality, etc. Once we reach a consensus > > on each topic, we can create different tracks and working streams to kick > > off the implementations. > > > > I believe continuously improving Parquet would benefit the industry more > > than creating a new format, which could add friction. These improvement > > ideas are exciting opportunities. If you, your team members, or friends > > have time and interest, please encourage them to contribute. > > > > Our Parquet community meeting is next week, on May 28, 2024. We can have > > discussions there if you can join. Currently, it is scheduled for 7:00 am > > PDT, but I can change it according to the majority's availability. > > > > On Fri, May 17, 2024 at 3:58 PM Rok Mihevc <rok.mih...@gmail.com> wrote: > > > > > Hi all, > > > > > > I've discussed with my colleagues and we would dedicate two engineers > for > > > 4-6 months on tasks related to implementing the format changes. We're > > > already active in design discussions and can help with C++, Rust and C# > > > implementations. I thought it'd be good to state this explicitly FWIW. > > > > > > Our main areas of interest are efficient reads for tables with wide > > schemas > > > and faster random rowgroup access [1]. > > > > > > To workaround the wide schemas issue we actually implemented an > internal > > > tool [3] for storing index information into a separate file which > allows > > > for reading only the necessary subset of metadata. We would offer this > > > approach for consideration as a possible approach to solve the wide > > schema > > > problem. > > > > > > [1] https://github.com/apache/arrow/issues/39676 > > > [2] https://github.com/G-Research/PalletJack > > > > > > Rok > > > > > > On Sun, May 12, 2024 at 12:59 AM Micah Kornfield < > emkornfi...@gmail.com> > > > wrote: > > > > > > > Hi Parquet Dev, > > > > I wanted to start a conversation within the community about working > on > > a > > > > new revision of Parquet. For context there have been a bunch of new > > > > formats [1][2][3] that show there is decent room for improvement > across > > > > data encodings and how metadata is organized. > > > > > > > > Specifically, in a new format revision I think we should be thinking > > > about > > > > the following areas for improvements: > > > > 1. More efficient encodings that allow for data skipping and SIMD > > > > optimizations. > > > > 2. More efficient metadata handling for deserialization and > projection > > > to > > > > address areas when metadata deserialization time is not trivial [4]. > > > > 3. Possibly thinking about different encodings instead of > > > > repetition/definition for repeated and nested field > > > > 4. Support for optimizing semi-structured data (e.g. JSON or Variant > > > type) > > > > that can shred elements into individual columns (a recent thread in > > > Iceberg > > > > mentions doing this at the metadata level [5]) > > > > > > > > I think the goals of V3 would be to provide existing API > compatibility > > as > > > > broadly as possible (possibly with some performance loss) and expose > > new > > > > API surface areas where appropriate to make use of new elements. New > > > > encodings could be backported so they can be made use of without > > metadata > > > > changes. I think unfortunately that for points 2 and 3 we would want > > to > > > > break file level compatibility. More thought would be needed to > > consider > > > > whether 4 could be backported effectively. > > > > > > > > This is a non-trivial amount of work to get good coverage across > > > > implementations, so before putting together more formal proposal it > > would > > > > be nice to know if: > > > > > > > > 1. If there is an appetite in the general community to consider > these > > > > changes > > > > 2. If anybody from the community is interested in collaborating on > > > > proposals/implementation in this area. > > > > > > > > Thanks, > > > > Micah > > > > > > > > [1] https://github.com/maxi-k/btrblocks > > > > [2] https://github.com/facebookincubator/nimble > > > > [3] https://blog.lancedb.com/lance-v2/ > > > > [4] https://github.com/apache/arrow/issues/39676 > > > > [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 > > > > > > > > > > > > > -- > > Xinli Shang > > >