Re: Typical data page size

Raphael Taylor-Davies Thu, 23 May 2024 09:28:38 -0700

One argument for it being a row limit and not a value limit, is thatrecords/rows are typically the fundamental unit of most parquetimplementations. As such providing indexing granularity finer than thisis of limited practical benefit, with zone maps, parallelism, etc...oriented around row boundaries. It also sidesteps the issue of what todo if a single record is larger than some page value count limit.


On 23/05/2024 16:50, Jan Finis wrote:

Now to the implied question: Is the default good? Or if not, what would be
a good default?


Weston gave the very high level argument. Let me try to tease it apart a
bit more to get a good common understanding of the trade-offs. I'll also
try to estimate which point is relevant and which is rather academic, and
what a resulting good default would be.

The considerations that benefit from a *bigger *page size:

    - *[Most relevant] Per-page decoding overhead: *When scanning, the page
    should be big enough to make the constant work that has to be performed per
    page negligible in contrast to the work required to read the actual values.
    This work is among other things:
       - Decoding the page header (basically reading some thrift varsize
       integers)
       - Choosing and instantiating the decoders for data, R, and D levels
       - If applicable, checking the statistics (min/max) to apply scan
       restrictions
       - "Lost opportunity" cost for leaving the hot decoding loop of the
       previous page. I.e., vectorized, unrolled loop ends, etc.

       This is the most important consideration of all I would say, but it
       is only relevant once the page size gets quite small. Once your page
       reaches a certain number of values, the handling of the actual
values will
       dominate the decoding. I would argue (from doing a lot of experiments on
       Parquet and our own, comparable format) that 20000 is in the realm that
       should make all these costs mostly negligible. Maybe cranking it up a bit
       further (e.g., to 100000) would give you some more percent of
performance,
       but you will not see much difference.

       - *[Somewhat relevant] Per-page size overhead *The page should be big
    enough so that the space overhead of the page header is not too high. A
    page header is in the ballpark of around 10 bytes without statistics and
    maybe 20-30 with statistics (very raw back of the envelope guestimate).
    Thus, a page in the kilobytes should be big enough (~1% header overhead),
    so for uncompressed PLAIN pages, we should be good with 20000 values. If
    the page compresses heavily (say super long RLE runs), then the page header
    might end up being a more significant percentage of all data, say 10%. But
    note that we then already compress very well, so it's 10% of very little
    memory to begin with.
    You can construct a worst case in which we compress basically infinitely
    (say one long RLE run) and then the header size becomes very significant.
    But I would argue that this case isn't too relevant, as this column's
    memory usage will be super low, even if you pay a lot of overhead for the
    header in this case.

    - *[Somewhat relevant] Compression performance:* Most black-box
    compression schemes are better when they can compress more memory at once.
    I don't have numbers here, but I guess we will compress some percent worse
    (say 5-20%) if we compress a few kilobytes instead of a megabyte at once.

    - *[Mostly irrelevant] Encoding disadvantages: *Certain encodings are
    "broken up" by beginning a new page, reducing their efficiency. RLE needs
    to prematurely end its current run or bit pack, DELTA_BINARY_PACKED needs
    to prematurely end its current mini block, etc. Here again, 20000 is big
    enough so this doesn't matter at all. This would matter if we just had 10
    values in the page, but with 20000, this effect should mostly be nil.

Considerations for a *smaller *page size:

    - *[Relevant, if you have point look-ups] Point lookups:* By using the
    offset index, we can load and decode single pages. For a point-lookup
    that's searching for a single row, a smaller page is therefore better. On
    the I/O side, we're mostly optimal once we're smaller than the block size
    of our medium (say 4kb for SSD). On the decoding side, the smaller the
    better. Some encodings allow random access but others do not (RLE (and
    therefore DICT), DELTA_*). Compression is also all or nothing, so once you
    compress your pages, then you never have random access into a page, so
    smaller is better.

    If you have a scenario that is point-lookup heavy, then 20000 is almost
    too much and you would rather like to have something like 1000 or even
    less. But I guess we should not tune the defaults for point-lookups, but
    rather for mostly scans while keeping point lookups in mind. For this,
    20000 seems to be a good trade-off, as scans do not suffer, while it's
    still okay for point access. 1 million values would be way worse here,
    speaking against the current default of parquet-rs.

    - *[Relevant, if you have scan restrictions] More fine granular min/max
    bounds: *The point look-ups above are an extreme case. Much more common
    is the case that you have a scan with scan restrictions. The min/max
    synopses of pages / page index can be used to skip pages that cannot have
    matches. Here, finer grained pages allow for tighter min/max bounds, so
    they are at an advantage. E.g., if you store all values x from 1 to 1
    million (in ascending order) and now have a query WHERE x BETWEEN 100000
    AND 199999, we could skip all but five pages if we had a page size of 20000
    but we would need to read everything if our page size was 1 million.


    - *[Somewhat relevant] Memory usage while decoding: *A reader will
    usually need at least memory proportional to the page size (e.g., for
    decompressing a compressed page). You could argue that 1 MiB is still not
    too much and I agree. A naive reader could however also require memory
    proportional to the number of values in the page (e.g., it could decode the
    whole page into a buffer). So, you want to have an upper bound on the
    number of values, to avoid cases where an infinite compression (e.g., a
    single RLE run) makes a page contain an unbounded number of values.


    - *[Somewhat relevant] Memory usage and complexity while encoding:*
    Encoders also have to work harder for larger pages. E.g., a dictionary
    encoder has to find out which dictionary ids the values would have, if the
    page was encoded by it. This either requires a reverse dictionary or the
    sorting of the values. This is faster for less values (reverse dictionary
    fits into cache, sorting is O(n log n)).

    - *[Could be relevant, but only with an optimized implementation] Fine
    grained encoding choice: *As each page may use a different encoding, we
    can react better to skew in the data distribution by being able to switch
    the encoding more often. E.g., if we first only have unique values and then
    later a lot of repeated values, we could first use PLAIN and later
    DICTIONARY. This is mostly irrelevant for now, as the library (at least
    parquet-mr, haven't checked the others) isn't this clever. It has very
    simple fallback rules (first dict, if not one fallback encoding) which are
    unlikely to take advantage of a more fine granular encoding choice. But in
    a perfect world where the writer chooses encodings very carefully, more
    fine granular encoding is valuable. E.g., the BtrBlocks paper chooses the
    encoding per page by sampling the page contents. With this strategy,
    smaller pages could make a difference.

    - *[Quite relevant if the reader supports it] Intra-row-group
    parallelism: *See our v3 discussion thread. If you want to parallelize
    inside a row group, then you will have concurrent readers that have to read
    *parts* of a page. And since compression and some encodings (e.g., RLE and
    thus  DICT, DELTA_*) do not allow you to have O(1) random access into a
    page, scanning parts of a page is easier if the page is small.

These would be all arguments that I considered so far. All in all, the
conclusion I draw from this is that the current defaults of parquet-mr are
fine. 20000 seems like a sane in-between value with the 1MiB page size
limit for large strings to ensure that long strings do not create huge
pages. 1 Million values would be too much IMHO. It's okay if you only do
scans, do not use min/max values often and do not use intra-row-group
parallelism, but I would say it's not a good default, as some readers will
have some of these use cases.

One final questionable point to me is that parquet-mr applies 20000 as a*
row limit*, not as a *value limit*. Consequently, pages of nested columns
will not abide by the 20000 limit, but can have much much more values.
Whether this is a good idea is somewhat questionable to me. I think I
myself would rather apply a 20000 value limit, so that also pages in nested
columns abide by the same rules. I don't see a big argument why you would
not want these limits for nested columns.

Cheers,
Jan

Am Do., 23. Mai 2024 um 17:28 Uhr schrieb Ed Seidl <etse...@live.com>:

I haven't seen it mentioned in this thread, but for the curious the
20000 row limit appears to come from a 2020 blog post by Cloudera [1]
(in the section "Testing with Parquet-MR").

Cheers,
Ed

[1]

https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/

On 5/23/24 6:50 AM, Raphael Taylor-Davies wrote:

The rust implementation supports limiting the number of rows in a
page, although this is disabled by default. If there is consensus that
20,000 is the recommended limit, I don't see any issue with changing
this default.

On 23/05/2024 14:39, Jan Finis wrote:

Addendum, since Fokko mentioned Iceberg.

Iceberg does the same, also applying a 20000 row limit by default

(

https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L137C3-L137C67

)

Am Do., 23. Mai 2024 um 15:38 Uhr schrieb Jan Finis <jpfi...@gmail.com

The 1 MiB page size limit of parquet-mr is a red herring. Parquet-mr
(now
parquet-java) actually writes *way smaller* pages by default.
parquet-mr
has actually *three limits* for deciding when to finish a page:

     - The size limit, which is 1MiB by default, as you mention.
     (DEFAULT_PAGE_SIZE)
     - A value limit, which is INT_MAX / 2 by default (so not really a
     limit, if the default is used)
(DEFAULT_PAGE_VALUE_COUNT_THRESHOLD).
     - A row count limit, which is 20000 by default.
     (DEFAULT_PAGE_ROW_COUNT_LIMIT)
     This limit will, in practice hit *way* before the page size
limit of
     1MiB is reached.

(See

https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L238

for the code that checks all three limits)

Thus, the page size limit is rather an upper bound for very large
values
(e.g., long strings) or very many values in case of nested columns.
It will
usually not be reached at all for a normal non-nested, non-long-string
column.

Rather the pages will actually be quite small due to the 20000 row
limit,
e.g., in PLAIN encoding, a page without any R and D levels would be
80kB
for 4 byte values and 160kB for 8 byte values. And this is *before*
applying compression. If your values compress very well, or if you
use an
encoding that is way smaller (e.g., dict) pages will be way smaller.
E.g.,
say you only have 16 distinct values in the page, then dictionary
encoding
with 4 bit keys will be used, leading to a page of only 10kB, even
if there
are not any runs in here. As some data types compress very well
(either due
to RLE (dict keys), DELTA_* or due to black box compression applied on
top), I have seen many pages < 1kB in practice.

Cheers,
Jan




Am Do., 23. Mai 2024 um 15:05 Uhr schrieb Antoine Pitrou <
anto...@python.org>:

Speaking of which and responding to my own question, parquet-java also
defaults to 1 MiB:

https://github.com/apache/parquet-java/blob/9b11410f15410b4d76d9f73f9545cf9110488517/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L49


Regards

Antoine.



On Thu, 23 May 2024 01:39:58 -1000
Jacques Nadeau <jacq...@apache.org> wrote:

I've found that a variable page size based on expected read back
number

of

columns is necessary since you'll need read back memory equal to
number

of

columns times page size times number concurrent files being read.
So if

one

is reading back 1000 columns one may  need 1gb+ of memory per file
for
reads. This resulted in sizing things down as width went up to avoid
spending excessive budget on read memory. This often resulted in
pages
closer to 64k - 128k. (in the work I did, we typically expected many

files

to be concurrently read across many requested ops.)

On Wed, May 22, 2024, 11:50 PM Andrew Lamb <

andrewlamb11-re5jqeeqqe8avxtiumw...@public.gmane.org> wrote:

The Rust implementation uses 1MB pages by default[1]

Andrew

[1]:

https://github.com/apache/arrow-rs/blob/bd5d4a59db5d6d0e1b3bdf00644dbaf317f3be03/parquet/src/file/properties.rs#L28-L29

On Thu, May 23, 2024 at 4:10 AM Fokko Driesprong

<fokko-1odqgaof3llqfi55v6+...@public.gmane.orgg> wrote:

Hey Antoine,

Thanks for raising this. In Iceberg we also use the 1 MiB page
size:

https://github.com/apache/iceberg/blob/b3c25fb7608934d975a054b353823ca001ca3742/core/src/main/java/org/apache/iceberg/TableProperties.java#L133

Kind regards,
Fokko

Op do 23 mei 2024 om 10:06 schreef Antoine Pitrou <

antoine-+zn9apsxkcednm+yrof...@public.gmane.org>:

Hello,

The Parquet format itself (or at least the README) recommends a 8

kiB

page size, suggesting that data pages are the unit of computation.

However, Parquet C++ has long chosen a 1 MiB page size by default

(*),

suggesting that data pages are considered as the unit of IO there.

(*) even bumping it to 64 MiB at some point, perhaps by mistake:

https://github.com/apache/arrow/commit/4078b876e0cc7503f4da16693ce7901a6ae503d3

What are the typical choices in other writers?

Regards

Antoine.

Re: Typical data page size

Reply via email to