Hi Alkis,

Fortunately, our data is such that compression of repetition levels will
always be highly effective and desirable, so it'll be sufficient to make a
global decision to always use v1 pages.

Thanks,

--phil


On Wed, 9 Apr 2025 at 09:52, Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:

> Every v2 capable reader can decode v1 pages. With that in mind, could you
> try to compress repetition levels and if effective, emit a v1 page,
> otherwise emit a v2 page? You would lose the ability to compress or not the
> data since v1 pages are either all compressed or all uncompressed.
>
> This is a bit hacky, but maybe it can address the issue for this particular
> case?
>
>
> On Wed, Apr 9, 2025 at 6:30 PM Phil Langdale <plangd...@roblox.com.invalid
> >
> wrote:
>
> > Hi Jan,
> >
> > Thanks for the details! I'm comfortable continuing with v1 pages.
> >
> > --phil
> >
> >
> > On Tue, 8 Apr 2025 at 03:09, Jan Finis <jpfi...@gmail.com> wrote:
> >
> > > Hi Phil,
> > >
> > > I can just make an educated guess here:
> > > I would think that in the design of v2 pages, it was indeed expected,
> > that
> > > RLE compresses the levels well enough to be sufficient and the fact
> that
> > no
> > > decompression is needed would result in certain queries being faster.
> > > That's why v2 doesn't compress the levels.
> > >
> > > But you are of course right that there are pathological cases, which
> are
> > > actually not too uncommon, where RLE doesn't compress well while a
> > > compression like ZSTD would compress very well. So, one could argue
> that
> > v2
> > > never compressing levels is actually not a good design for this case;
> it
> > > would rather be good if v2 could decide to compress levels with a flag
> > > (there is a flag whether the data is compressed, but sadly none for the
> > > levels). But v2 is what it is and therefore doesn't compress levels.
> So I
> > > think your train of thought is correct and valid: For your specific use
> > > case, v1 compresses way better than v2 and might therefore be superior.
> > >
> > > Note though that actually v1 is the default for most writers, so using
> v2
> > > in the first place is still somewhat a niche thing. So, it's not only
> > > "that's why v1 pages are still supported" but rather a "there is v2,
> but
> > > most people write v1 anyway". So there is nothing wrong with writing v1
> > > pages in your use case; they won't go away anytime soon and will likely
> > be
> > > supported by readers forever.
> > >
> > > Cheers,
> > > Jan
> > >
> > > Am Mo., 7. Apr. 2025 um 19:16 Uhr schrieb Phil Langdale
> > > <plangd...@roblox.com.invalid>:
> > >
> > > > Hi everyone,
> > > >
> > > > This is my first time I've had to look deeply into Parquet internals,
> > so
> > > > let me apologise if this has been discussed elsewhere in the past.
> I've
> > > > tried to do my due diligence in terms of searching online but I
> haven't
> > > > found a clear answer.
> > > >
> > > > I'm currently trying to define a suitable parquet schema for storing
> > > > Prometheus originated time-series data, focusing on space
> efficiency. I
> > > > have what looks like a highly effective solution using V1 data pages,
> > but
> > > > with V2, the lack of compression of repetition levels results in a
> > > massive
> > > > loss of comparative efficiency. I'm trying to understand whether this
> > > > behaviour was considered in the original V2 design, and whether I'm
> > > missing
> > > > something in how I'm trying to use the format. Is continuing to use
> V1
> > > data
> > > > pages the correct solution for my use-case?
> > > >
> > > > Here are my specifics:
> > > >
> > > > * I am exporting data from Prometheus that is already well sorted. If
> > it
> > > > wasn't, I would do the sorting myself. This ensures that metrics are
> > > > initially sorted by name, then by the set of labels, then by
> timestamp.
> > > > This should lead to best case data for encoding and compression, and
> my
> > > > results support this.
> > > >
> > > > I have the following schema:
> > > >
> > > > message spark_schema {
> > > >   required binary metric_name (STRING);
> > > >   required group labels (MAP) {
> > > >     repeated group key_value {
> > > >       required binary key (STRING);
> > > >       optional binary value (STRING);
> > > >     }
> > > >   }
> > > >   required double value;
> > > >   required int64 timestamp (TIMESTAMP(MILLIS,true));
> > > > }
> > > >
> > > > and I'm explicitly forcing the use of DELTA_BINARY_PACKED for
> > timestamps
> > > > and BYTE_STREAM_SPLIT for the double values.
> > > >
> > > > For metric names and label key/values, I'm using normal dictionary
> > > > encoding, and the cardinality of these is low in the sample data I'm
> > > > working with. So far so good. In terms of labels, each sample has a
> few
> > > 10s
> > > > of labels (eg: 26 in one of the worked examples below). Due to the
> > > sorting,
> > > > each data page will typically be made up of rows where the set of
> > labels
> > > is
> > > > identical for every row. This means that the repetition level
> sequence
> > > for
> > > > each row will also look identical. And so, although RLE is in use and
> > > leads
> > > > to the repetitionLevels for a given row taking up 4 bytes, this 4
> byte
> > > > sequence is then repeated ~3000 times. With v1 data pages, the
> > whole-page
> > > > compression will naturally handle this incredibly well, as it's a
> > > best-case
> > > > scenario. But with v2 data pages, this block is left uncompressed,
> and
> > > ends
> > > > up being the largest contributor to the final file size, leading to
> > files
> > > > that are 15x bigger than with v1 pages.
> > > >
> > > > Here is some data.
> > > >
> > > > With v1 pages
> > > > -------------------
> > > >
> > > > Meta:
> > > >
> > > > Row group 0:  count: 2741760  0.58 B records  start: 4
> > > total(compressed):
> > > > 1.526 MB total(uncompressed):126.067 MB
> > > >
> ----------------------------------------------------------------------
> > > >                         type      encodings count     avg size
>  nulls
> > > > metric_name             BINARY    Z _ R     2741760   0.00 B     0
> > > > labels.key_value.key    BINARY    Z _ R     57358080  0.00 B     0
> > > > labels.key_value.value  BINARY    Z _ R     57358080  0.01 B     0
> > > > value                   DOUBLE    Z         2741760   0.32 B     0
> > > > timestamp               INT64     Z   D     2741760   0.03 B     0
> > > >
> > > > and some debug data for a page:
> > > >
> > > > ColumnChunkPageWriteStore: writePageV1: compressor: ZSTD, row count:
> > > 3787,
> > > > uncompressed size: 69016, compressed size: 267
> > > >
> > > > With v2 pages
> > > > -------------------
> > > >
> > > > Meta:
> > > >
> > > > Row group 0:  count: 2741760  8.58 B records  start: 4
> > > total(compressed):
> > > > 22.427 MB total(uncompressed):126.167 MB
> > > >
> ----------------------------------------------------------------------
> > > >                         type      encodings count     avg size
>  nulls
> > > > metric_name             BINARY    Z _ R     2741760   0.00 B     0
> > > > labels.key_value.key    BINARY    Z _ R     57358080  0.20 B     0
> > > > labels.key_value.value  BINARY    Z _ R     57358080  0.20 B     0
> > > > value                   DOUBLE    Z         2741760   0.32 B     0
> > > > timestamp               INT64     Z   D     2741760   0.03 B     0
> > > >
> > > > and
> > > >
> > > > ColumnChunkPageWriteStore: writePageV2: compressor: ZSTD, row count:
> > > 3773,
> > > > uncompressed size: 51612, compressed size: 264, repetitionLevels size
> > > > 15092, definitionLevels size 4
> > > >
> > > > So we can see that all the extra space is due to uncompressed
> > repetition
> > > > levels. Is this use-cases considered pathological? I'm not sure how,
> > but
> > > > maybe there's something else that will trip me up down the line that
> > you
> > > > can tell me about. Similarly, maybe I'll discover that the
> > decompression
> > > > overhead of v1 is so painful that this is unusable.
> > > >
> > > > In the end, is this simply a case of "that's why v1 pages are still
> > > > supported" and I move ahead with that, or should it be possible for
> me
> > to
> > > > use v2 pages, and something else is going wrong?
> > > >
> > > > Thank you for your insights!
> > > >
> > > > --phil
> > > >
> > >
> >
>

Reply via email to