Now to the implied question: Is the default good? Or if not, what would be
a good default?
Weston gave the very high level argument. Let me try to tease it apart a
bit more to get a good common understanding of the trade-offs. I'll also
try to estimate which point is relevant and which is rather academic, and
what a resulting good default would be.
The considerations that benefit from a *bigger *page size:
- *[Most relevant] Per-page decoding overhead: *When scanning, the page
should be big enough to make the constant work that has to be performed per
page negligible in contrast to the work required to read the actual values.
This work is among other things:
- Decoding the page header (basically reading some thrift varsize
integers)
- Choosing and instantiating the decoders for data, R, and D levels
- If applicable, checking the statistics (min/max) to apply scan
restrictions
- "Lost opportunity" cost for leaving the hot decoding loop of the
previous page. I.e., vectorized, unrolled loop ends, etc.
This is the most important consideration of all I would say, but it
is only relevant once the page size gets quite small. Once your page
reaches a certain number of values, the handling of the actual
values will
dominate the decoding. I would argue (from doing a lot of experiments on
Parquet and our own, comparable format) that 20000 is in the realm that
should make all these costs mostly negligible. Maybe cranking it up a bit
further (e.g., to 100000) would give you some more percent of
performance,
but you will not see much difference.
- *[Somewhat relevant] Per-page size overhead *The page should be big
enough so that the space overhead of the page header is not too high. A
page header is in the ballpark of around 10 bytes without statistics and
maybe 20-30 with statistics (very raw back of the envelope guestimate).
Thus, a page in the kilobytes should be big enough (~1% header overhead),
so for uncompressed PLAIN pages, we should be good with 20000 values. If
the page compresses heavily (say super long RLE runs), then the page header
might end up being a more significant percentage of all data, say 10%. But
note that we then already compress very well, so it's 10% of very little
memory to begin with.
You can construct a worst case in which we compress basically infinitely
(say one long RLE run) and then the header size becomes very significant.
But I would argue that this case isn't too relevant, as this column's
memory usage will be super low, even if you pay a lot of overhead for the
header in this case.
- *[Somewhat relevant] Compression performance:* Most black-box
compression schemes are better when they can compress more memory at once.
I don't have numbers here, but I guess we will compress some percent worse
(say 5-20%) if we compress a few kilobytes instead of a megabyte at once.
- *[Mostly irrelevant] Encoding disadvantages: *Certain encodings are
"broken up" by beginning a new page, reducing their efficiency. RLE needs
to prematurely end its current run or bit pack, DELTA_BINARY_PACKED needs
to prematurely end its current mini block, etc. Here again, 20000 is big
enough so this doesn't matter at all. This would matter if we just had 10
values in the page, but with 20000, this effect should mostly be nil.
Considerations for a *smaller *page size:
- *[Relevant, if you have point look-ups] Point lookups:* By using the
offset index, we can load and decode single pages. For a point-lookup
that's searching for a single row, a smaller page is therefore better. On
the I/O side, we're mostly optimal once we're smaller than the block size
of our medium (say 4kb for SSD). On the decoding side, the smaller the
better. Some encodings allow random access but others do not (RLE (and
therefore DICT), DELTA_*). Compression is also all or nothing, so once you
compress your pages, then you never have random access into a page, so
smaller is better.
If you have a scenario that is point-lookup heavy, then 20000 is almost
too much and you would rather like to have something like 1000 or even
less. But I guess we should not tune the defaults for point-lookups, but
rather for mostly scans while keeping point lookups in mind. For this,
20000 seems to be a good trade-off, as scans do not suffer, while it's
still okay for point access. 1 million values would be way worse here,
speaking against the current default of parquet-rs.
- *[Relevant, if you have scan restrictions] More fine granular min/max
bounds: *The point look-ups above are an extreme case. Much more common
is the case that you have a scan with scan restrictions. The min/max
synopses of pages / page index can be used to skip pages that cannot have
matches. Here, finer grained pages allow for tighter min/max bounds, so
they are at an advantage. E.g., if you store all values x from 1 to 1
million (in ascending order) and now have a query WHERE x BETWEEN 100000
AND 199999, we could skip all but five pages if we had a page size of 20000
but we would need to read everything if our page size was 1 million.
- *[Somewhat relevant] Memory usage while decoding: *A reader will
usually need at least memory proportional to the page size (e.g., for
decompressing a compressed page). You could argue that 1 MiB is still not
too much and I agree. A naive reader could however also require memory
proportional to the number of values in the page (e.g., it could decode the
whole page into a buffer). So, you want to have an upper bound on the
number of values, to avoid cases where an infinite compression (e.g., a
single RLE run) makes a page contain an unbounded number of values.
- *[Somewhat relevant] Memory usage and complexity while encoding:*
Encoders also have to work harder for larger pages. E.g., a dictionary
encoder has to find out which dictionary ids the values would have, if the
page was encoded by it. This either requires a reverse dictionary or the
sorting of the values. This is faster for less values (reverse dictionary
fits into cache, sorting is O(n log n)).
- *[Could be relevant, but only with an optimized implementation] Fine
grained encoding choice: *As each page may use a different encoding, we
can react better to skew in the data distribution by being able to switch
the encoding more often. E.g., if we first only have unique values and then
later a lot of repeated values, we could first use PLAIN and later
DICTIONARY. This is mostly irrelevant for now, as the library (at least
parquet-mr, haven't checked the others) isn't this clever. It has very
simple fallback rules (first dict, if not one fallback encoding) which are
unlikely to take advantage of a more fine granular encoding choice. But in
a perfect world where the writer chooses encodings very carefully, more
fine granular encoding is valuable. E.g., the BtrBlocks paper chooses the
encoding per page by sampling the page contents. With this strategy,
smaller pages could make a difference.
- *[Quite relevant if the reader supports it] Intra-row-group
parallelism: *See our v3 discussion thread. If you want to parallelize
inside a row group, then you will have concurrent readers that have to read
*parts* of a page. And since compression and some encodings (e.g., RLE and
thus DICT, DELTA_*) do not allow you to have O(1) random access into a
page, scanning parts of a page is easier if the page is small.
These would be all arguments that I considered so far. All in all, the
conclusion I draw from this is that the current defaults of parquet-mr are
fine. 20000 seems like a sane in-between value with the 1MiB page size
limit for large strings to ensure that long strings do not create huge
pages. 1 Million values would be too much IMHO. It's okay if you only do
scans, do not use min/max values often and do not use intra-row-group
parallelism, but I would say it's not a good default, as some readers will
have some of these use cases.
One final questionable point to me is that parquet-mr applies 20000 as a*
row limit*, not as a *value limit*. Consequently, pages of nested columns
will not abide by the 20000 limit, but can have much much more values.
Whether this is a good idea is somewhat questionable to me. I think I
myself would rather apply a 20000 value limit, so that also pages in nested
columns abide by the same rules. I don't see a big argument why you would
not want these limits for nested columns.
Cheers,
Jan
Am Do., 23. Mai 2024 um 17:28 Uhr schrieb Ed Seidl <etse...@live.com>:
I haven't seen it mentioned in this thread, but for the curious the
20000 row limit appears to come from a 2020 blog post by Cloudera [1]
(in the section "Testing with Parquet-MR").
Cheers,
Ed
[1]
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
On 5/23/24 6:50 AM, Raphael Taylor-Davies wrote:
The rust implementation supports limiting the number of rows in a
page, although this is disabled by default. If there is consensus that
20,000 is the recommended limit, I don't see any issue with changing
this default.
On 23/05/2024 14:39, Jan Finis wrote:
Addendum, since Fokko mentioned Iceberg.
Iceberg does the same, also applying a 20000 row limit by default
(
https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L137C3-L137C67
)
Am Do., 23. Mai 2024 um 15:38 Uhr schrieb Jan Finis <jpfi...@gmail.com
:
The 1 MiB page size limit of parquet-mr is a red herring. Parquet-mr
(now
parquet-java) actually writes *way smaller* pages by default.
parquet-mr
has actually *three limits* for deciding when to finish a page:
- The size limit, which is 1MiB by default, as you mention.
(DEFAULT_PAGE_SIZE)
- A value limit, which is INT_MAX / 2 by default (so not really a
limit, if the default is used)
(DEFAULT_PAGE_VALUE_COUNT_THRESHOLD).
- A row count limit, which is 20000 by default.
(DEFAULT_PAGE_ROW_COUNT_LIMIT)
This limit will, in practice hit *way* before the page size
limit of
1MiB is reached.
(See
https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L238
for the code that checks all three limits)
Thus, the page size limit is rather an upper bound for very large
values
(e.g., long strings) or very many values in case of nested columns.
It will
usually not be reached at all for a normal non-nested, non-long-string
column.
Rather the pages will actually be quite small due to the 20000 row
limit,
e.g., in PLAIN encoding, a page without any R and D levels would be
80kB
for 4 byte values and 160kB for 8 byte values. And this is *before*
applying compression. If your values compress very well, or if you
use an
encoding that is way smaller (e.g., dict) pages will be way smaller.
E.g.,
say you only have 16 distinct values in the page, then dictionary
encoding
with 4 bit keys will be used, leading to a page of only 10kB, even
if there
are not any runs in here. As some data types compress very well
(either due
to RLE (dict keys), DELTA_* or due to black box compression applied on
top), I have seen many pages < 1kB in practice.
Cheers,
Jan
Am Do., 23. Mai 2024 um 15:05 Uhr schrieb Antoine Pitrou <
anto...@python.org>:
Speaking of which and responding to my own question, parquet-java also
defaults to 1 MiB:
https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L49
Regards
Antoine.
On Thu, 23 May 2024 01:39:58 -1000
Jacques Nadeau <jacq...@apache.org> wrote:
I've found that a variable page size based on expected read back
number
of
columns is necessary since you'll need read back memory equal to
number
of
columns times page size times number concurrent files being read.
So if
one
is reading back 1000 columns one may need 1gb+ of memory per file
for
reads. This resulted in sizing things down as width went up to avoid
spending excessive budget on read memory. This often resulted in
pages
closer to 64k - 128k. (in the work I did, we typically expected many
files
to be concurrently read across many requested ops.)
On Wed, May 22, 2024, 11:50 PM Andrew Lamb <
andrewlamb11-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:
The Rust implementation uses 1MB pages by default[1]
Andrew
[1]:
https://github.com/apache/arrow-rs/blob/bd5d4a59db5d6d0e1b3bdf00644dbaf317f3be03/parquet/src/file/properties.rs#L28-L29
On Thu, May 23, 2024 at 4:10 AM Fokko Driesprong
<fokko-1odqgaof3llqfi55v6+...@public.gmane.orgg> wrote:
Hey Antoine,
Thanks for raising this. In Iceberg we also use the 1 MiB page
size:
https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L133
Kind regards,
Fokko
Op do 23 mei 2024 om 10:06 schreef Antoine Pitrou <
antoine-+zn9apsxkcednm+yrof...@public.gmane.org>:
Hello,
The Parquet format itself (or at least the README) recommends a 8
kiB
page size, suggesting that data pages are the unit of computation.
However, Parquet C++ has long chosen a 1 MiB page size by default
(*),
suggesting that data pages are considered as the unit of IO there.
(*) even bumping it to 64 MiB at some point, perhaps by mistake:
https://github.com/apache/arrow/commit/4078b876e0cc7503f4da16693ce7901a6ae503d3
What are the typical choices in other writers?
Regards
Antoine.