> My point is that one of these (compression chunk size) is a file format concern and the other one (encoding size) is an encoding concern.
Slight typo :) I mean "page size" is a file format concern and "compression chunk size" is an encoding concern. On Tue, May 21, 2024 at 6:40 PM Weston Pace <weston.p...@gmail.com> wrote: > Thank you for your questions! I think your understanding is very solid. > > > Do I understand correctly that you basically replace row groups with > > files. Thus, the task for reading row groups in parallel boils down to > > reading files in parallel. > > Partly. I recommend files for inter-process parallelism (e.g. > distributing a query across multiple worker processes). > > > Your post does *not* claim that the new format > > would be able to parallelize *inside* a row group/file, correct? > > We do parallelize inside a file. Imagine a simple file with 3 int32 > columns, 2 million rows per column and 1 8MiB page per column. > > First we would issue 3 parallel I/O requests to read those 3 pages. Once > those three requests complete we then create decode tasks. I tend to think > about things in "query engine" terms. Query engines often process data in > relatively small batches. 10k rows is a common batch size (probably > because 10k TPC-H rows fit nicely inside CPU caches). For example, duckdb > uses either 10k or 2k (I think 10k for parquet and 2k for its internal > format, or maybe it breaks the 10k into 2k size chunks). Datafusion uses > 8k chunks. Acero uses 32k but that is probably too big. > > Since we have 2 million rows we can create 200 thread tasks, each with 10k > rows, which gives us plenty of parallelism. The first thing these thread > tasks do is decode the data. Larger files with more complex columns and > many pages follow pretty much the same strategy (but it gets quite > complicated) > > > As such, couldn't you do the same "Decode Based > > Parallelism" also with Parquet as it is today? > > Yes, absolutely, and I would recommend it. > > > I do not fully understand what the proposed parallelism has to do with > > the file format. > > Nothing, you are correct. In my blog posts I am generally discussing > "designing a high performing file reader" and not "designing a new file > format". My concerns with the parquet format are only the A/B/C points I > mentioned in my previous email. > > > So all in all, do I see correctly that your main argument here basically > is > > "don't force pages to be contiguous!". Doing away with row groups is just > > added bonus for easier maintenance, as you can just use files instead of > > row groups. > > Absolutely. As long as pages don't have to be contiguous then I am > happy. I'd simply always create parquet files with 1 row group. > > > As such, it seems that the two grains that Parquet has benefit us, as > they > > give us flexibility of both being able to scan with large requests and > > doing point accesses without too much read amplification by using small > > single-page requests. > > Again, you are spot on. Both grains are needed. My point is that one of > these (compression chunk size) is a file format concern and the other one > (encoding size) is an encoding concern. In fact, there is a third grain, > which is the size used for zone maps in pushdown indices. There may be > many other grains too (e.g. run end encoded arrays need skip tables for > point lookups) Parquet forces all of these to be the same value (page > size). > > The solution to this problem in Lance (if you want to do some kind of > block compression) is to create a single 8MB page. That page would consist > of many compressed chunks stored contiguously. If the compression has > variable sized blocks (not sure if this is a thing, I don't know > compression well) then you need to store block sizes in a column metadata > buffer. If the blocks are fixed size then you need to store how many rows > are in each block in a column metadata buffer. A compression encoder can > then pick whatever compression chunk size makes the most sense. > > So, for example, you could have 64KB compression chunks, zone maps for > every 1024 rows, and 8MB pages for I/O. > > > > On Tue, May 21, 2024 at 2:40 PM Jan Finis <jpfi...@gmail.com> wrote: > >> Thanks Weston for posting here! >> >> I appreciate this a lot, as it gives us the opportunity to discuss modern >> formats in depth with the authors themselves, who probably know the design >> trade-offs they took best and thus can give us a deeper understanding what >> certain features would mean for Parquet. >> >> I read both your linked posts. I read them with the mindset as if they >> were >> the documentation for a file format that I myself would need to add to our >> engine, so I always double checked whether I would agree with your >> reasoning and where I would see problems in the implementation. >> >> I ended up with some points where I cannot follow your reasoning, yet, or >> where I feel clarification would be good. It would be nice if you could go >> a bit into detail here: >> >> Regarding your "parallelism without row groups" post [2]: >> >> 1. Do I understand correctly that you basically replace row groups with >> files. Thus, the task for reading row groups in parallel boils down to >> reading files in parallel. Your post does *not* claim that the new format >> would be able to parallelize *inside* a row group/file, correct? >> >> 2. I do not fully understand what the proposed parallelism has to do with >> the file format. As you mention yourself, files and row groups are >> basically the same thing. As such, couldn't you do the same "Decode Based >> Parallelism" also with Parquet as it is today? E.g., the file reader in >> our >> engine looks basically exactly like what you propose, employing what you >> call Mini Batches and not reading a whole row group as a whole (which >> could >> lead to out of memory in case a row group contains an insane amount of >> rows, so it is a big no no anyway for us). It seems that the shortcomings >> of the code listed in "Our First Parallel File Reader" is solely a >> shortcoming of that code, not of the underlying format. >> >> Regarding [1]: >> >> 3. This one is mostly about understanding your rationales: >> >> As one main argument for abolishing row groups, you mention that sizing >> them well is hard (I fully agree!). But since you replace row groups with >> files, don't you have the same problem for the file again? Small row >> groups/files are bad due to small I/O requests and metadata explosion, >> agree! So let's use bigger ones. Here you argue that Parquet readers will >> load the whole row group into memory and therefore suffer memory issues. >> This is a strawman IMHO, as this is just a shortcoming of the reader, not >> of the format. Nothing in the Parquet spec forces a reader to read a row >> group at once (and in fact, our implementation doesn't do this for exactly >> the reasons you mentioned). Just like in LanceV2, Parquet readers can opt >> to read only a few pages ahead of the decoding. >> >> On the writing side, I see your point that a Lance V2 writer never has to >> buffer more than a page and this is great! However, this seems to be just >> a >> result of allowing pages to not be contiguous, not of the fact that row >> groups were abolished. You could still support multiple row groups with >> non-contiguous pages and reap all the benefits you mention. Your post >> intermingles the two design choices "contiguous pages yes/no" and "row >> groups as horizontal partitions within a file yes/no". I would argue that >> the two features are basically fully orthogonal. You can have one without >> the other and vice versa. >> >> So all in all, do I see correctly that your main argument here basically >> is >> "don't force pages to be contiguous!". Doing away with row groups is just >> added bonus for easier maintenance, as you can just use files instead of >> row groups. >> >> >> 4. Considering contiguous pages and I/O granularity: >> >> The format basically proposes to have pages as the only granularity below >> a >> file (+ metadata & footer), while Parquet has two granularities: Row >> group, >> or rather Column Chunk, and Page. You argue that a page in Lance V2 should >> basically be as big as is necessary for good I/O performance (say, 8 MiB >> for Amazon S3). Thus, the Parquet counterpart of a Lance v2 page would >> actually be - at least in terms of I/O efficiency - a Parquet Column >> Chunk. >> A Parquet page can instead be quite small, as it does not need to be the >> grain of the I/O but just the grain of the encoding. >> >> The fact that Parquet has these two grains has advantages when considering >> a scan vs. a point look-up. When doing a scan, we can load whole column >> chunks at once, having large I/O requests to not overwhelm the I/O with >> too >> many requests. When doing a point access, we can use the page & offset >> index to find and load only the one page (per column) in which the row we >> are looking for is located. >> >> As such, it seems that the two grains that Parquet has benefit us, as they >> give us flexibility of both being able to scan with large requests and >> doing point accesses without too much read amplification by using small >> single-page requests. With Lance V2, either I make large pages to make >> scans take fewer I/O requests (e.g., 8 MiB), but then I will have large >> read amplification for point accesses, or I make my pages quite small to >> benefit point accesses, but then scans will need to emit tons of I/O >> operations, which is what you are trying to avoid. How does Lance V2 solve >> this challenge? Or did I understand the format wrong here? >> >> Cheers, >> Jan >> >> Am Di., 21. Mai 2024 um 18:07 Uhr schrieb Weston Pace < >> weston.p...@gmail.com >> >: >> >> > As the author of one of these new formats I'll chime in. The main >> issues I >> > have with parquet are: >> > >> > A. Pages in a column chunk must be contiguous (this is Lance's biggest >> > issue with parquet) >> > B. Encodings should be extensible >> > C. Flexibility in what is considered data / metadata >> > >> > I outline my reasoning for these in [1] and so I'll avoid repeating that >> > here. I think B has been discussed pretty thoroughly in this thread. >> > >> > As for C, a format should be flexible, and then it is pretty >> > straightforward. If a file is likely to be used in "search" (very >> > selective filters, ability to cache, etc.) then lots of data should be >> put >> > in the column metadata. If the file is mostly for cold full scans then >> > almost nothing should go in column metadata (either don't write the >> > metadata at all or, I guess, you can put it in the data pages). The >> format >> > shouldn't force a choice. >> > >> > Personally, I am more excited about A than I am about B & C (though I do >> > think both B & C should be addressed if going through the trouble of a >> new >> > format). Addressing A lets us get rid of row groups, allows for APIs >> such >> > as "array-at-a-time writing", lets us make large data pages, and >> generally >> > leads to more foolproof files. >> > >> > I agree with Andrew that any discussion of B & C is usually based on >> > assumptions rather than concrete measurements of reader performance. In >> > the scattered profiling I've done of parquet-cpp and parquet-rs I've >> found >> > that poor parquet reader performance typically has very little to do >> with B >> > & C. Actually, I would guess that the most widespread (though not >> > necessarily most important) obstacle to parquet has been user knowledge. >> > To get the best performance from a reader users need to be familiar not >> > just with the format but also with the features available in a >> particular >> > reader. I think simplifying the user experience should be a secondary >> goal >> > for any new changes. >> > >> > At the risk of arrogant self-promotion I would recommend people read [1] >> > for inspiration if nothing else. I'm also hoping to detail design >> > decisions and tradeoffs that we come across (starting in [2] and >> continuing >> > throughout the summer). >> > >> > [1] https://blog.lancedb.com/lance-v2/ >> > [2] >> > >> > >> https://blog.lancedb.com/file-readers-in-depth-parallelism-without-row-groups/ >> > >> > On Mon, May 20, 2024 at 11:06 AM Parth Chandra <par...@apache.org> >> wrote: >> > >> > > Hi Parquet team, >> > > >> > > It is very exciting to see this effort. Thanks Micah for starting >> this. >> > > >> > > For most use case that our team sees the broad areas for improvement >> > > appear to be - >> > > 1) Optimizing for cloud storage (latency is high, seeks are >> expensive) >> > > 2) Optimized metadata reading - we've seen 30% (sometimes more) of >> > > Spark's scan operator time spent in reading footers. >> > > 3) Anything that improves support for data lakes. >> > > >> > > Also I'll be happy to help wherever I can. >> > > >> > > Parth >> > > >> > > On Sun, May 19, 2024 at 10:59 AM Xinli shang <sha...@uber.com.invalid >> > >> > > wrote: >> > > >> > > > Sorry I am late to the party! It's great to see this discussion! >> Thank >> > > you >> > > > everyone for the many good points and thank you, Micah, for starting >> > the >> > > > discussion and putting it together into a document, which is very >> > > helpful! >> > > > I agree with most of the points we discussed above, and we need to >> > > improve >> > > > Parquet and sometimes even speed up to catch up with industry >> changes. >> > > > >> > > > With that said, we need people to work on it, as Julien mentioned. >> The >> > > > document >> > > > < >> > > > >> > > >> > >> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit >> > > > > >> > > > that Micah created covers pretty much everything we discussed here. >> I >> > > > encourage all of us to contribute by raising questions, providing >> > > > suggestions, adding missing functionality, etc. Once we reach a >> > consensus >> > > > on each topic, we can create different tracks and working streams to >> > kick >> > > > off the implementations. >> > > > >> > > > I believe continuously improving Parquet would benefit the industry >> > more >> > > > than creating a new format, which could add friction. These >> improvement >> > > > ideas are exciting opportunities. If you, your team members, or >> friends >> > > > have time and interest, please encourage them to contribute. >> > > > >> > > > Our Parquet community meeting is next week, on May 28, 2024. We can >> > have >> > > > discussions there if you can join. Currently, it is scheduled for >> 7:00 >> > am >> > > > PDT, but I can change it according to the majority's availability. >> > > > >> > > > On Fri, May 17, 2024 at 3:58 PM Rok Mihevc <rok.mih...@gmail.com> >> > wrote: >> > > > >> > > > > Hi all, >> > > > > >> > > > > I've discussed with my colleagues and we would dedicate two >> engineers >> > > for >> > > > > 4-6 months on tasks related to implementing the format changes. >> We're >> > > > > already active in design discussions and can help with C++, Rust >> and >> > C# >> > > > > implementations. I thought it'd be good to state this explicitly >> > FWIW. >> > > > > >> > > > > Our main areas of interest are efficient reads for tables with >> wide >> > > > schemas >> > > > > and faster random rowgroup access [1]. >> > > > > >> > > > > To workaround the wide schemas issue we actually implemented an >> > > internal >> > > > > tool [3] for storing index information into a separate file which >> > > allows >> > > > > for reading only the necessary subset of metadata. We would offer >> > this >> > > > > approach for consideration as a possible approach to solve the >> wide >> > > > schema >> > > > > problem. >> > > > > >> > > > > [1] https://github.com/apache/arrow/issues/39676 >> > > > > [2] https://github.com/G-Research/PalletJack >> > > > > >> > > > > Rok >> > > > > >> > > > > On Sun, May 12, 2024 at 12:59 AM Micah Kornfield < >> > > emkornfi...@gmail.com> >> > > > > wrote: >> > > > > >> > > > > > Hi Parquet Dev, >> > > > > > I wanted to start a conversation within the community about >> working >> > > on >> > > > a >> > > > > > new revision of Parquet. For context there have been a bunch of >> > new >> > > > > > formats [1][2][3] that show there is decent room for improvement >> > > across >> > > > > > data encodings and how metadata is organized. >> > > > > > >> > > > > > Specifically, in a new format revision I think we should be >> > thinking >> > > > > about >> > > > > > the following areas for improvements: >> > > > > > 1. More efficient encodings that allow for data skipping and >> SIMD >> > > > > > optimizations. >> > > > > > 2. More efficient metadata handling for deserialization and >> > > projection >> > > > > to >> > > > > > address areas when metadata deserialization time is not trivial >> > [4]. >> > > > > > 3. Possibly thinking about different encodings instead of >> > > > > > repetition/definition for repeated and nested field >> > > > > > 4. Support for optimizing semi-structured data (e.g. JSON or >> > Variant >> > > > > type) >> > > > > > that can shred elements into individual columns (a recent >> thread in >> > > > > Iceberg >> > > > > > mentions doing this at the metadata level [5]) >> > > > > > >> > > > > > I think the goals of V3 would be to provide existing API >> > > compatibility >> > > > as >> > > > > > broadly as possible (possibly with some performance loss) and >> > expose >> > > > new >> > > > > > API surface areas where appropriate to make use of new elements. >> > New >> > > > > > encodings could be backported so they can be made use of without >> > > > metadata >> > > > > > changes. I think unfortunately that for points 2 and 3 we would >> > want >> > > > to >> > > > > > break file level compatibility. More thought would be needed to >> > > > consider >> > > > > > whether 4 could be backported effectively. >> > > > > > >> > > > > > This is a non-trivial amount of work to get good coverage across >> > > > > > implementations, so before putting together more formal >> proposal it >> > > > would >> > > > > > be nice to know if: >> > > > > > >> > > > > > 1. If there is an appetite in the general community to consider >> > > these >> > > > > > changes >> > > > > > 2. If anybody from the community is interested in >> collaborating on >> > > > > > proposals/implementation in this area. >> > > > > > >> > > > > > Thanks, >> > > > > > Micah >> > > > > > >> > > > > > [1] https://github.com/maxi-k/btrblocks >> > > > > > [2] https://github.com/facebookincubator/nimble >> > > > > > [3] https://blog.lancedb.com/lance-v2/ >> > > > > > [4] https://github.com/apache/arrow/issues/39676 >> > > > > > [5] >> > https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 >> > > > > > >> > > > > >> > > > >> > > > >> > > > -- >> > > > Xinli Shang >> > > > >> > > >> > >> >