Thanks Weston for the answers, very insightful! I appreciate your input very much.
Some follow ups :): My point is that one of > these (compression chunk size) is a file format concern and the other one > (encoding size) is an encoding concern. The fact that Lance V2 still can have multiple grains is interesting and so far was not clear in our discussion, thanks for the clarification! As you describe, Lance V2 allows for further "sub pages" inside a page. Parquet does not have that. Thus, if we now just drop the restrictions that Parquet pages (which - at least as of today - do not allow nesting of smaller sub pages) need to be contiguous in Parquet, we would end up with something that is less powerful than Lance V2. I'm wondering how we can best arrive at something that is as powerful as the concept in Lance V2 without throwing all Parquet concepts out of the window. Here is my first draft of how this could look like: When you think it through in detail, a Lance V2 page is actually the equivalent of a Parquet column chunk,* not of a Parquet page,* as the former is the (preferred) unit of I/O while the latter is the unit of encoding. And as you describe, a Lance V2 page is the unit of I/O, not the unit of encoding. That is an important point to keep in mind in our discussion, as this might otherwise lead to a lot of misunderstandings. Thus, to make Parquet column chunks as powerful as Lance V2 pages, we must allow *multiple column chunks* per column inside a Parquet row group. This should give us the desired equivalent of the advantages of Lance V2. Thus, if we want this (and I think we do), we should do the following change for Parquet V3: *A row group can consist of one or many column chunks per column* (while before it could consist of only one)*.* We should not say that Parquet pages do not need to be contiguous, as Parquet pages are *not* the grain of I/O but the grain of encoding. Do you agree with this, or did I get something wrong? (I see that Lance V2 is even more flexible than that. But I guess this is a good inbetween solution that allows us to leverage the advantages of "non contiguity" without needing to make all of Parquet's concepts pluggable, which could be too big of a change.) We do parallelize inside a file. Interesting, but this comes with challenges. How do you solve these? As page boundaries are not necessarily aligned between columns, this means that mini batches can start and end in the middle of pages. This comes with the following challenges: - Multiple concurrent decoders can share the same page (e.g., one decodes the first part of it and the other the last part). Thus, either we request it two times from storage, wasting I/O bandwidth, or we make readers share I/O requests. So we would need some kind of reference counting or a similar technique to check how many decoders still need a page that has been fetched from I/O before we can free the read memory. This is possible, but it increases the complexity of the I/O + decode stack considerably. Does the Lance V2 format subsume a reader that is this clever? (This is not a real downside, as I am more than happy to implement complex I/O sharing code if it leads to a more efficient format; I just wanted to mention this trade-off) - Page sharing assumes that you have a fast (hopefully O(1), at worst I'd say O(log n)) way to only read parts of a page. Whether this is possible depends on the encoding. E.g., Parquet has encodings where this is possible (PLAIN) and ones where this is not possible (RLE, DELTA_*, etc.). With these encodings, two parallel read threads would need to do duplicate work to read only parts of a page. Also black box compression (LZ4, etc.) can only decompress a page as a whole. How do you solve this challenge in LanceDB? @Steve Jan, assuming you are one of the authors of "Get Real: How Benchmarks Fail > to Represent the Real World", can I get a preprint copy? As it's relevant > here. My acm library access is gone until I upload another 10 Banksy mural > photos to wikimedia. > This link should work without any library access: https://dl.acm.org/doi/pdf/10.1145/3209950.3209952 (I tried from my phone, which definitely doesn't have any ACM access). You might need to press the "PDF" button. Let me know if it doesn't work, then I'll try to find a pre-print. Cheers, Jan Am Mi., 22. Mai 2024 um 16:10 Uhr schrieb Steve Loughran <ste...@cloudera.com.invalid>: > On Tue, 21 May 2024 at 22:40, Jan Finis <jpfi...@gmail.com> wrote: > > > Thanks Weston for posting here! > > > > I appreciate this a lot, as it gives us the opportunity to discuss modern > > formats in depth with the authors themselves, who probably know the > design > > trade-offs they took best and thus can give us a deeper understanding > what > > certain features would mean for Parquet. > > > > This is becoming an interesting question. > > > > > I read both your linked posts. I read them with the mindset as if they > were > > the documentation for a file format that I myself would need to add to > our > > engine, so I always double checked whether I would agree with your > > reasoning and where I would see problems in the implementation. > > > > I ended up with some points where I cannot follow your reasoning, yet, or > > where I feel clarification would be good. It would be nice if you could > go > > a bit into detail here: > > > > Regarding your "parallelism without row groups" post [2]: > > > > 1. Do I understand correctly that you basically replace row groups with > > files. Thus, the task for reading row groups in parallel boils down to > > reading files in parallel. Your post does *not* claim that the new format > > would be able to parallelize *inside* a row group/file, correct? > > > > 2. I do not fully understand what the proposed parallelism has to do with > > the file format. As you mention yourself, files and row groups are > > basically the same thing. As such, couldn't you do the same "Decode Based > > Parallelism" also with Parquet as it is today? E.g., the file reader in > our > > engine looks basically exactly like what you propose, employing what you > > call Mini Batches and not reading a whole row group as a whole (which > could > > lead to out of memory in case a row group contains an insane amount of > > rows, so it is a big no no anyway for us). It seems that the shortcomings > > of the code listed in "Our First Parallel File Reader" is solely a > > shortcoming of that code, not of the underlying format. > > > > Regarding [1]: > > > > 3. This one is mostly about understanding your rationales: > > > > As one main argument for abolishing row groups, you mention that sizing > > them well is hard (I fully agree!). But since you replace row groups with > > files, don't you have the same problem for the file again? Small row > > groups/files are bad due to small I/O requests and metadata explosion, > > agree! So let's use bigger ones. Here you argue that Parquet readers will > > load the whole row group into memory and therefore suffer memory issues. > > This is a strawman IMHO, as this is just a shortcoming of the reader, not > > of the format. Nothing in the Parquet spec forces a reader to read a row > > group at once (and in fact, our implementation doesn't do this for > exactly > > the reasons you mentioned). Just like in LanceV2, Parquet readers can opt > > to read only a few pages ahead of the decoding. > > > > Parquet-mr now supports parallel GET requests on different ranges on s3; > it'd be even better if object stores supported multiple range requests in a > single GET. > Doing all this reading in a single file is more efficient client-side than > having different files > open. > > > > On the writing side, I see your point that a Lance V2 writer never has to > > buffer more than a page and this is great! However, this seems to be > just a > > result of allowing pages to not be contiguous, not of the fact that row > > groups were abolished. You could still support multiple row groups with > > non-contiguous pages and reap all the benefits you mention. Your post > > intermingles the two design choices "contiguous pages yes/no" and "row > > groups as horizontal partitions within a file yes/no". I would argue that > > the two features are basically fully orthogonal. You can have one without > > the other and vice versa. > > > > So all in all, do I see correctly that your main argument here basically > is > > "don't force pages to be contiguous!". Doing away with row groups is just > > added bonus for easier maintenance, as you can just use files instead of > > row groups. > > > > > > 4. Considering contiguous pages and I/O granularity: > > > > The format basically proposes to have pages as the only granularity > below a > > file (+ metadata & footer), while Parquet has two granularities: Row > group, > > or rather Column Chunk, and Page. You argue that a page in Lance V2 > should > > basically be as big as is necessary for good I/O performance (say, 8 MiB > > for Amazon S3). > > > way bigger, please. small files are so expensive on s3, whenever you start > doing any directory operations, bulk copies etc. Same for the other stores. > > Jan, assuming you are one of the authors of "Get Real: How Benchmarks Fail > to Represent the Real World", can I get a preprint copy? As it's relevant > here. My acm library access is gone until I upload another 10 Banksy mural > photos to wikimedia. > > Steve >