On 23/05/2024 16:50, Jan Finis wrote:
Now to the implied question: Is the default good? Or if not, what would be a good default?Weston gave the very high level argument. Let me try to tease it apart a bit more to get a good common understanding of the trade-offs. I'll also try to estimate which point is relevant and which is rather academic, and what a resulting good default would be. The considerations that benefit from a *bigger *page size: - *[Most relevant] Per-page decoding overhead: *When scanning, the page should be big enough to make the constant work that has to be performed per page negligible in contrast to the work required to read the actual values. This work is among other things: - Decoding the page header (basically reading some thrift varsize integers) - Choosing and instantiating the decoders for data, R, and D levels - If applicable, checking the statistics (min/max) to apply scan restrictions - "Lost opportunity" cost for leaving the hot decoding loop of the previous page. I.e., vectorized, unrolled loop ends, etc. This is the most important consideration of all I would say, but it is only relevant once the page size gets quite small. Once your page reaches a certain number of values, the handling of the actual values will dominate the decoding. I would argue (from doing a lot of experiments on Parquet and our own, comparable format) that 20000 is in the realm that should make all these costs mostly negligible. Maybe cranking it up a bit further (e.g., to 100000) would give you some more percent of performance, but you will not see much difference. - *[Somewhat relevant] Per-page size overhead *The page should be big enough so that the space overhead of the page header is not too high. A page header is in the ballpark of around 10 bytes without statistics and maybe 20-30 with statistics (very raw back of the envelope guestimate). Thus, a page in the kilobytes should be big enough (~1% header overhead), so for uncompressed PLAIN pages, we should be good with 20000 values. If the page compresses heavily (say super long RLE runs), then the page header might end up being a more significant percentage of all data, say 10%. But note that we then already compress very well, so it's 10% of very little memory to begin with. You can construct a worst case in which we compress basically infinitely (say one long RLE run) and then the header size becomes very significant. But I would argue that this case isn't too relevant, as this column's memory usage will be super low, even if you pay a lot of overhead for the header in this case. - *[Somewhat relevant] Compression performance:* Most black-box compression schemes are better when they can compress more memory at once. I don't have numbers here, but I guess we will compress some percent worse (say 5-20%) if we compress a few kilobytes instead of a megabyte at once. - *[Mostly irrelevant] Encoding disadvantages: *Certain encodings are "broken up" by beginning a new page, reducing their efficiency. RLE needs to prematurely end its current run or bit pack, DELTA_BINARY_PACKED needs to prematurely end its current mini block, etc. Here again, 20000 is big enough so this doesn't matter at all. This would matter if we just had 10 values in the page, but with 20000, this effect should mostly be nil. Considerations for a *smaller *page size: - *[Relevant, if you have point look-ups] Point lookups:* By using the offset index, we can load and decode single pages. For a point-lookup that's searching for a single row, a smaller page is therefore better. On the I/O side, we're mostly optimal once we're smaller than the block size of our medium (say 4kb for SSD). On the decoding side, the smaller the better. Some encodings allow random access but others do not (RLE (and therefore DICT), DELTA_*). Compression is also all or nothing, so once you compress your pages, then you never have random access into a page, so smaller is better. If you have a scenario that is point-lookup heavy, then 20000 is almost too much and you would rather like to have something like 1000 or even less. But I guess we should not tune the defaults for point-lookups, but rather for mostly scans while keeping point lookups in mind. For this, 20000 seems to be a good trade-off, as scans do not suffer, while it's still okay for point access. 1 million values would be way worse here, speaking against the current default of parquet-rs. - *[Relevant, if you have scan restrictions] More fine granular min/max bounds: *The point look-ups above are an extreme case. Much more common is the case that you have a scan with scan restrictions. The min/max synopses of pages / page index can be used to skip pages that cannot have matches. Here, finer grained pages allow for tighter min/max bounds, so they are at an advantage. E.g., if you store all values x from 1 to 1 million (in ascending order) and now have a query WHERE x BETWEEN 100000 AND 199999, we could skip all but five pages if we had a page size of 20000 but we would need to read everything if our page size was 1 million. - *[Somewhat relevant] Memory usage while decoding: *A reader will usually need at least memory proportional to the page size (e.g., for decompressing a compressed page). You could argue that 1 MiB is still not too much and I agree. A naive reader could however also require memory proportional to the number of values in the page (e.g., it could decode the whole page into a buffer). So, you want to have an upper bound on the number of values, to avoid cases where an infinite compression (e.g., a single RLE run) makes a page contain an unbounded number of values. - *[Somewhat relevant] Memory usage and complexity while encoding:* Encoders also have to work harder for larger pages. E.g., a dictionary encoder has to find out which dictionary ids the values would have, if the page was encoded by it. This either requires a reverse dictionary or the sorting of the values. This is faster for less values (reverse dictionary fits into cache, sorting is O(n log n)). - *[Could be relevant, but only with an optimized implementation] Fine grained encoding choice: *As each page may use a different encoding, we can react better to skew in the data distribution by being able to switch the encoding more often. E.g., if we first only have unique values and then later a lot of repeated values, we could first use PLAIN and later DICTIONARY. This is mostly irrelevant for now, as the library (at least parquet-mr, haven't checked the others) isn't this clever. It has very simple fallback rules (first dict, if not one fallback encoding) which are unlikely to take advantage of a more fine granular encoding choice. But in a perfect world where the writer chooses encodings very carefully, more fine granular encoding is valuable. E.g., the BtrBlocks paper chooses the encoding per page by sampling the page contents. With this strategy, smaller pages could make a difference. - *[Quite relevant if the reader supports it] Intra-row-group parallelism: *See our v3 discussion thread. If you want to parallelize inside a row group, then you will have concurrent readers that have to read *parts* of a page. And since compression and some encodings (e.g., RLE and thus DICT, DELTA_*) do not allow you to have O(1) random access into a page, scanning parts of a page is easier if the page is small. These would be all arguments that I considered so far. All in all, the conclusion I draw from this is that the current defaults of parquet-mr are fine. 20000 seems like a sane in-between value with the 1MiB page size limit for large strings to ensure that long strings do not create huge pages. 1 Million values would be too much IMHO. It's okay if you only do scans, do not use min/max values often and do not use intra-row-group parallelism, but I would say it's not a good default, as some readers will have some of these use cases. One final questionable point to me is that parquet-mr applies 20000 as a* row limit*, not as a *value limit*. Consequently, pages of nested columns will not abide by the 20000 limit, but can have much much more values. Whether this is a good idea is somewhat questionable to me. I think I myself would rather apply a 20000 value limit, so that also pages in nested columns abide by the same rules. I don't see a big argument why you would not want these limits for nested columns. Cheers, Jan Am Do., 23. Mai 2024 um 17:28 Uhr schrieb Ed Seidl <etse...@live.com>:I haven't seen it mentioned in this thread, but for the curious the 20000 row limit appears to come from a 2020 blog post by Cloudera [1] (in the section "Testing with Parquet-MR"). Cheers, Ed [1] https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/ On 5/23/24 6:50 AM, Raphael Taylor-Davies wrote:The rust implementation supports limiting the number of rows in a page, although this is disabled by default. If there is consensus that 20,000 is the recommended limit, I don't see any issue with changing this default. On 23/05/2024 14:39, Jan Finis wrote:Addendum, since Fokko mentioned Iceberg. Iceberg does the same, also applying a 20000 row limit by default (https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L137C3-L137C67) Am Do., 23. Mai 2024 um 15:38 Uhr schrieb Jan Finis <jpfi...@gmail.com:The 1 MiB page size limit of parquet-mr is a red herring. Parquet-mr (now parquet-java) actually writes *way smaller* pages by default. parquet-mr has actually *three limits* for deciding when to finish a page: - The size limit, which is 1MiB by default, as you mention. (DEFAULT_PAGE_SIZE) - A value limit, which is INT_MAX / 2 by default (so not really a limit, if the default is used) (DEFAULT_PAGE_VALUE_COUNT_THRESHOLD). - A row count limit, which is 20000 by default. (DEFAULT_PAGE_ROW_COUNT_LIMIT) This limit will, in practice hit *way* before the page size limit of 1MiB is reached. (Seehttps://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L238for the code that checks all three limits) Thus, the page size limit is rather an upper bound for very large values (e.g., long strings) or very many values in case of nested columns. It will usually not be reached at all for a normal non-nested, non-long-string column. Rather the pages will actually be quite small due to the 20000 row limit, e.g., in PLAIN encoding, a page without any R and D levels would be 80kB for 4 byte values and 160kB for 8 byte values. And this is *before* applying compression. If your values compress very well, or if you use an encoding that is way smaller (e.g., dict) pages will be way smaller. E.g., say you only have 16 distinct values in the page, then dictionary encoding with 4 bit keys will be used, leading to a page of only 10kB, even if there are not any runs in here. As some data types compress very well (either due to RLE (dict keys), DELTA_* or due to black box compression applied on top), I have seen many pages < 1kB in practice. Cheers, Jan Am Do., 23. Mai 2024 um 15:05 Uhr schrieb Antoine Pitrou < anto...@python.org>:Speaking of which and responding to my own question, parquet-java also defaults to 1 MiB:https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L49Regards Antoine. On Thu, 23 May 2024 01:39:58 -1000 Jacques Nadeau <jacq...@apache.org> wrote:I've found that a variable page size based on expected read back numberofcolumns is necessary since you'll need read back memory equal to numberofcolumns times page size times number concurrent files being read. So ifoneis reading back 1000 columns one may need 1gb+ of memory per file for reads. This resulted in sizing things down as width went up to avoid spending excessive budget on read memory. This often resulted in pages closer to 64k - 128k. (in the work I did, we typically expected manyfilesto be concurrently read across many requested ops.) On Wed, May 22, 2024, 11:50 PM Andrew Lamb <andrewlamb11-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:The Rust implementation uses 1MB pages by default[1] Andrew [1]:https://github.com/apache/arrow-rs/blob/bd5d4a59db5d6d0e1b3bdf00644dbaf317f3be03/parquet/src/file/properties.rs#L28-L29On Thu, May 23, 2024 at 4:10 AM Fokko Driesprong<fokko-1odqgaof3llqfi55v6+...@public.gmane.orgg> wrote:Hey Antoine, Thanks for raising this. In Iceberg we also use the 1 MiB page size:https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L133Kind regards, Fokko Op do 23 mei 2024 om 10:06 schreef Antoine Pitrou <antoine-+zn9apsxkcednm+yrof...@public.gmane.org>:Hello, The Parquet format itself (or at least the README) recommends a 8kiBpage size, suggesting that data pages are the unit of computation. However, Parquet C++ has long chosen a 1 MiB page size by default(*),suggesting that data pages are considered as the unit of IO there. (*) even bumping it to 64 MiB at some point, perhaps by mistake:https://github.com/apache/arrow/commit/4078b876e0cc7503f4da16693ce7901a6ae503d3What are the typical choices in other writers? Regards Antoine.