Every v2 capable reader can decode v1 pages. With that in mind, could you
try to compress repetition levels and if effective, emit a v1 page,
otherwise emit a v2 page? You would lose the ability to compress or not the
data since v1 pages are either all compressed or all uncompressed.

This is a bit hacky, but maybe it can address the issue for this particular
case?


On Wed, Apr 9, 2025 at 6:30 PM Phil Langdale <plangd...@roblox.com.invalid>
wrote:

> Hi Jan,
>
> Thanks for the details! I'm comfortable continuing with v1 pages.
>
> --phil
>
>
> On Tue, 8 Apr 2025 at 03:09, Jan Finis <jpfi...@gmail.com> wrote:
>
> > Hi Phil,
> >
> > I can just make an educated guess here:
> > I would think that in the design of v2 pages, it was indeed expected,
> that
> > RLE compresses the levels well enough to be sufficient and the fact that
> no
> > decompression is needed would result in certain queries being faster.
> > That's why v2 doesn't compress the levels.
> >
> > But you are of course right that there are pathological cases, which are
> > actually not too uncommon, where RLE doesn't compress well while a
> > compression like ZSTD would compress very well. So, one could argue that
> v2
> > never compressing levels is actually not a good design for this case; it
> > would rather be good if v2 could decide to compress levels with a flag
> > (there is a flag whether the data is compressed, but sadly none for the
> > levels). But v2 is what it is and therefore doesn't compress levels. So I
> > think your train of thought is correct and valid: For your specific use
> > case, v1 compresses way better than v2 and might therefore be superior.
> >
> > Note though that actually v1 is the default for most writers, so using v2
> > in the first place is still somewhat a niche thing. So, it's not only
> > "that's why v1 pages are still supported" but rather a "there is v2, but
> > most people write v1 anyway". So there is nothing wrong with writing v1
> > pages in your use case; they won't go away anytime soon and will likely
> be
> > supported by readers forever.
> >
> > Cheers,
> > Jan
> >
> > Am Mo., 7. Apr. 2025 um 19:16 Uhr schrieb Phil Langdale
> > <plangd...@roblox.com.invalid>:
> >
> > > Hi everyone,
> > >
> > > This is my first time I've had to look deeply into Parquet internals,
> so
> > > let me apologise if this has been discussed elsewhere in the past. I've
> > > tried to do my due diligence in terms of searching online but I haven't
> > > found a clear answer.
> > >
> > > I'm currently trying to define a suitable parquet schema for storing
> > > Prometheus originated time-series data, focusing on space efficiency. I
> > > have what looks like a highly effective solution using V1 data pages,
> but
> > > with V2, the lack of compression of repetition levels results in a
> > massive
> > > loss of comparative efficiency. I'm trying to understand whether this
> > > behaviour was considered in the original V2 design, and whether I'm
> > missing
> > > something in how I'm trying to use the format. Is continuing to use V1
> > data
> > > pages the correct solution for my use-case?
> > >
> > > Here are my specifics:
> > >
> > > * I am exporting data from Prometheus that is already well sorted. If
> it
> > > wasn't, I would do the sorting myself. This ensures that metrics are
> > > initially sorted by name, then by the set of labels, then by timestamp.
> > > This should lead to best case data for encoding and compression, and my
> > > results support this.
> > >
> > > I have the following schema:
> > >
> > > message spark_schema {
> > >   required binary metric_name (STRING);
> > >   required group labels (MAP) {
> > >     repeated group key_value {
> > >       required binary key (STRING);
> > >       optional binary value (STRING);
> > >     }
> > >   }
> > >   required double value;
> > >   required int64 timestamp (TIMESTAMP(MILLIS,true));
> > > }
> > >
> > > and I'm explicitly forcing the use of DELTA_BINARY_PACKED for
> timestamps
> > > and BYTE_STREAM_SPLIT for the double values.
> > >
> > > For metric names and label key/values, I'm using normal dictionary
> > > encoding, and the cardinality of these is low in the sample data I'm
> > > working with. So far so good. In terms of labels, each sample has a few
> > 10s
> > > of labels (eg: 26 in one of the worked examples below). Due to the
> > sorting,
> > > each data page will typically be made up of rows where the set of
> labels
> > is
> > > identical for every row. This means that the repetition level sequence
> > for
> > > each row will also look identical. And so, although RLE is in use and
> > leads
> > > to the repetitionLevels for a given row taking up 4 bytes, this 4 byte
> > > sequence is then repeated ~3000 times. With v1 data pages, the
> whole-page
> > > compression will naturally handle this incredibly well, as it's a
> > best-case
> > > scenario. But with v2 data pages, this block is left uncompressed, and
> > ends
> > > up being the largest contributor to the final file size, leading to
> files
> > > that are 15x bigger than with v1 pages.
> > >
> > > Here is some data.
> > >
> > > With v1 pages
> > > -------------------
> > >
> > > Meta:
> > >
> > > Row group 0:  count: 2741760  0.58 B records  start: 4
> > total(compressed):
> > > 1.526 MB total(uncompressed):126.067 MB
> > > ----------------------------------------------------------------------
> > >                         type      encodings count     avg size   nulls
> > > metric_name             BINARY    Z _ R     2741760   0.00 B     0
> > > labels.key_value.key    BINARY    Z _ R     57358080  0.00 B     0
> > > labels.key_value.value  BINARY    Z _ R     57358080  0.01 B     0
> > > value                   DOUBLE    Z         2741760   0.32 B     0
> > > timestamp               INT64     Z   D     2741760   0.03 B     0
> > >
> > > and some debug data for a page:
> > >
> > > ColumnChunkPageWriteStore: writePageV1: compressor: ZSTD, row count:
> > 3787,
> > > uncompressed size: 69016, compressed size: 267
> > >
> > > With v2 pages
> > > -------------------
> > >
> > > Meta:
> > >
> > > Row group 0:  count: 2741760  8.58 B records  start: 4
> > total(compressed):
> > > 22.427 MB total(uncompressed):126.167 MB
> > > ----------------------------------------------------------------------
> > >                         type      encodings count     avg size   nulls
> > > metric_name             BINARY    Z _ R     2741760   0.00 B     0
> > > labels.key_value.key    BINARY    Z _ R     57358080  0.20 B     0
> > > labels.key_value.value  BINARY    Z _ R     57358080  0.20 B     0
> > > value                   DOUBLE    Z         2741760   0.32 B     0
> > > timestamp               INT64     Z   D     2741760   0.03 B     0
> > >
> > > and
> > >
> > > ColumnChunkPageWriteStore: writePageV2: compressor: ZSTD, row count:
> > 3773,
> > > uncompressed size: 51612, compressed size: 264, repetitionLevels size
> > > 15092, definitionLevels size 4
> > >
> > > So we can see that all the extra space is due to uncompressed
> repetition
> > > levels. Is this use-cases considered pathological? I'm not sure how,
> but
> > > maybe there's something else that will trip me up down the line that
> you
> > > can tell me about. Similarly, maybe I'll discover that the
> decompression
> > > overhead of v1 is so painful that this is unusable.
> > >
> > > In the end, is this simply a case of "that's why v1 pages are still
> > > supported" and I move ahead with that, or should it be possible for me
> to
> > > use v2 pages, and something else is going wrong?
> > >
> > > Thank you for your insights!
> > >
> > > --phil
> > >
> >
>

Reply via email to