On Tue, 21 May 2024 at 22:40, Jan Finis <jpfi...@gmail.com> wrote: > Thanks Weston for posting here! > > I appreciate this a lot, as it gives us the opportunity to discuss modern > formats in depth with the authors themselves, who probably know the design > trade-offs they took best and thus can give us a deeper understanding what > certain features would mean for Parquet. >
This is becoming an interesting question. > > I read both your linked posts. I read them with the mindset as if they were > the documentation for a file format that I myself would need to add to our > engine, so I always double checked whether I would agree with your > reasoning and where I would see problems in the implementation. > > I ended up with some points where I cannot follow your reasoning, yet, or > where I feel clarification would be good. It would be nice if you could go > a bit into detail here: > > Regarding your "parallelism without row groups" post [2]: > > 1. Do I understand correctly that you basically replace row groups with > files. Thus, the task for reading row groups in parallel boils down to > reading files in parallel. Your post does *not* claim that the new format > would be able to parallelize *inside* a row group/file, correct? > > 2. I do not fully understand what the proposed parallelism has to do with > the file format. As you mention yourself, files and row groups are > basically the same thing. As such, couldn't you do the same "Decode Based > Parallelism" also with Parquet as it is today? E.g., the file reader in our > engine looks basically exactly like what you propose, employing what you > call Mini Batches and not reading a whole row group as a whole (which could > lead to out of memory in case a row group contains an insane amount of > rows, so it is a big no no anyway for us). It seems that the shortcomings > of the code listed in "Our First Parallel File Reader" is solely a > shortcoming of that code, not of the underlying format. > > Regarding [1]: > > 3. This one is mostly about understanding your rationales: > > As one main argument for abolishing row groups, you mention that sizing > them well is hard (I fully agree!). But since you replace row groups with > files, don't you have the same problem for the file again? Small row > groups/files are bad due to small I/O requests and metadata explosion, > agree! So let's use bigger ones. Here you argue that Parquet readers will > load the whole row group into memory and therefore suffer memory issues. > This is a strawman IMHO, as this is just a shortcoming of the reader, not > of the format. Nothing in the Parquet spec forces a reader to read a row > group at once (and in fact, our implementation doesn't do this for exactly > the reasons you mentioned). Just like in LanceV2, Parquet readers can opt > to read only a few pages ahead of the decoding. > Parquet-mr now supports parallel GET requests on different ranges on s3; it'd be even better if object stores supported multiple range requests in a single GET. Doing all this reading in a single file is more efficient client-side than having different files open. > On the writing side, I see your point that a Lance V2 writer never has to > buffer more than a page and this is great! However, this seems to be just a > result of allowing pages to not be contiguous, not of the fact that row > groups were abolished. You could still support multiple row groups with > non-contiguous pages and reap all the benefits you mention. Your post > intermingles the two design choices "contiguous pages yes/no" and "row > groups as horizontal partitions within a file yes/no". I would argue that > the two features are basically fully orthogonal. You can have one without > the other and vice versa. > > So all in all, do I see correctly that your main argument here basically is > "don't force pages to be contiguous!". Doing away with row groups is just > added bonus for easier maintenance, as you can just use files instead of > row groups. > > > 4. Considering contiguous pages and I/O granularity: > > The format basically proposes to have pages as the only granularity below a > file (+ metadata & footer), while Parquet has two granularities: Row group, > or rather Column Chunk, and Page. You argue that a page in Lance V2 should > basically be as big as is necessary for good I/O performance (say, 8 MiB > for Amazon S3). way bigger, please. small files are so expensive on s3, whenever you start doing any directory operations, bulk copies etc. Same for the other stores. Jan, assuming you are one of the authors of "Get Real: How Benchmarks Fail to Represent the Real World", can I get a preprint copy? As it's relevant here. My acm library access is gone until I upload another 10 Banksy mural photos to wikimedia. Steve