Re: Consequences of not compressing repetitionLevels in DataPagesV2

Phil Langdale Wed, 09 Apr 2025 09:38:52 -0700

Hi Jan,

Thanks for the details! I'm comfortable continuing with v1 pages.


--phil


On Tue, 8 Apr 2025 at 03:09, Jan Finis <jpfi...@gmail.com> wrote:

> Hi Phil,
>
> I can just make an educated guess here:
> I would think that in the design of v2 pages, it was indeed expected, that
> RLE compresses the levels well enough to be sufficient and the fact that no
> decompression is needed would result in certain queries being faster.
> That's why v2 doesn't compress the levels.
>
> But you are of course right that there are pathological cases, which are
> actually not too uncommon, where RLE doesn't compress well while a
> compression like ZSTD would compress very well. So, one could argue that v2
> never compressing levels is actually not a good design for this case; it
> would rather be good if v2 could decide to compress levels with a flag
> (there is a flag whether the data is compressed, but sadly none for the
> levels). But v2 is what it is and therefore doesn't compress levels. So I
> think your train of thought is correct and valid: For your specific use
> case, v1 compresses way better than v2 and might therefore be superior.
>
> Note though that actually v1 is the default for most writers, so using v2
> in the first place is still somewhat a niche thing. So, it's not only
> "that's why v1 pages are still supported" but rather a "there is v2, but
> most people write v1 anyway". So there is nothing wrong with writing v1
> pages in your use case; they won't go away anytime soon and will likely be
> supported by readers forever.
>
> Cheers,
> Jan
>
> Am Mo., 7. Apr. 2025 um 19:16 Uhr schrieb Phil Langdale
> <plangd...@roblox.com.invalid>:
>
> > Hi everyone,
> >
> > This is my first time I've had to look deeply into Parquet internals, so
> > let me apologise if this has been discussed elsewhere in the past. I've
> > tried to do my due diligence in terms of searching online but I haven't
> > found a clear answer.
> >
> > I'm currently trying to define a suitable parquet schema for storing
> > Prometheus originated time-series data, focusing on space efficiency. I
> > have what looks like a highly effective solution using V1 data pages, but
> > with V2, the lack of compression of repetition levels results in a
> massive
> > loss of comparative efficiency. I'm trying to understand whether this
> > behaviour was considered in the original V2 design, and whether I'm
> missing
> > something in how I'm trying to use the format. Is continuing to use V1
> data
> > pages the correct solution for my use-case?
> >
> > Here are my specifics:
> >
> > * I am exporting data from Prometheus that is already well sorted. If it
> > wasn't, I would do the sorting myself. This ensures that metrics are
> > initially sorted by name, then by the set of labels, then by timestamp.
> > This should lead to best case data for encoding and compression, and my
> > results support this.
> >
> > I have the following schema:
> >
> > message spark_schema {
> >   required binary metric_name (STRING);
> >   required group labels (MAP) {
> >     repeated group key_value {
> >       required binary key (STRING);
> >       optional binary value (STRING);
> >     }
> >   }
> >   required double value;
> >   required int64 timestamp (TIMESTAMP(MILLIS,true));
> > }
> >
> > and I'm explicitly forcing the use of DELTA_BINARY_PACKED for timestamps
> > and BYTE_STREAM_SPLIT for the double values.
> >
> > For metric names and label key/values, I'm using normal dictionary
> > encoding, and the cardinality of these is low in the sample data I'm
> > working with. So far so good. In terms of labels, each sample has a few
> 10s
> > of labels (eg: 26 in one of the worked examples below). Due to the
> sorting,
> > each data page will typically be made up of rows where the set of labels
> is
> > identical for every row. This means that the repetition level sequence
> for
> > each row will also look identical. And so, although RLE is in use and
> leads
> > to the repetitionLevels for a given row taking up 4 bytes, this 4 byte
> > sequence is then repeated ~3000 times. With v1 data pages, the whole-page
> > compression will naturally handle this incredibly well, as it's a
> best-case
> > scenario. But with v2 data pages, this block is left uncompressed, and
> ends
> > up being the largest contributor to the final file size, leading to files
> > that are 15x bigger than with v1 pages.
> >
> > Here is some data.
> >
> > With v1 pages
> > -------------------
> >
> > Meta:
> >
> > Row group 0:  count: 2741760  0.58 B records  start: 4
> total(compressed):
> > 1.526 MB total(uncompressed):126.067 MB
> > ----------------------------------------------------------------------
> >                         type      encodings count     avg size   nulls
> > metric_name             BINARY    Z _ R     2741760   0.00 B     0
> > labels.key_value.key    BINARY    Z _ R     57358080  0.00 B     0
> > labels.key_value.value  BINARY    Z _ R     57358080  0.01 B     0
> > value                   DOUBLE    Z         2741760   0.32 B     0
> > timestamp               INT64     Z   D     2741760   0.03 B     0
> >
> > and some debug data for a page:
> >
> > ColumnChunkPageWriteStore: writePageV1: compressor: ZSTD, row count:
> 3787,
> > uncompressed size: 69016, compressed size: 267
> >
> > With v2 pages
> > -------------------
> >
> > Meta:
> >
> > Row group 0:  count: 2741760  8.58 B records  start: 4
> total(compressed):
> > 22.427 MB total(uncompressed):126.167 MB
> > ----------------------------------------------------------------------
> >                         type      encodings count     avg size   nulls
> > metric_name             BINARY    Z _ R     2741760   0.00 B     0
> > labels.key_value.key    BINARY    Z _ R     57358080  0.20 B     0
> > labels.key_value.value  BINARY    Z _ R     57358080  0.20 B     0
> > value                   DOUBLE    Z         2741760   0.32 B     0
> > timestamp               INT64     Z   D     2741760   0.03 B     0
> >
> > and
> >
> > ColumnChunkPageWriteStore: writePageV2: compressor: ZSTD, row count:
> 3773,
> > uncompressed size: 51612, compressed size: 264, repetitionLevels size
> > 15092, definitionLevels size 4
> >
> > So we can see that all the extra space is due to uncompressed repetition
> > levels. Is this use-cases considered pathological? I'm not sure how, but
> > maybe there's something else that will trip me up down the line that you
> > can tell me about. Similarly, maybe I'll discover that the decompression
> > overhead of v1 is so painful that this is unusable.
> >
> > In the end, is this simply a case of "that's why v1 pages are still
> > supported" and I move ahead with that, or should it be possible for me to
> > use v2 pages, and something else is going wrong?
> >
> > Thank you for your insights!
> >
> > --phil
> >
>

Re: Consequences of not compressing repetitionLevels in DataPagesV2

Reply via email to