Hi everyone,

This is my first time I've had to look deeply into Parquet internals, so
let me apologise if this has been discussed elsewhere in the past. I've
tried to do my due diligence in terms of searching online but I haven't
found a clear answer.

I'm currently trying to define a suitable parquet schema for storing
Prometheus originated time-series data, focusing on space efficiency. I
have what looks like a highly effective solution using V1 data pages, but
with V2, the lack of compression of repetition levels results in a massive
loss of comparative efficiency. I'm trying to understand whether this
behaviour was considered in the original V2 design, and whether I'm missing
something in how I'm trying to use the format. Is continuing to use V1 data
pages the correct solution for my use-case?

Here are my specifics:

* I am exporting data from Prometheus that is already well sorted. If it
wasn't, I would do the sorting myself. This ensures that metrics are
initially sorted by name, then by the set of labels, then by timestamp.
This should lead to best case data for encoding and compression, and my
results support this.

I have the following schema:

message spark_schema {
  required binary metric_name (STRING);
  required group labels (MAP) {
    repeated group key_value {
      required binary key (STRING);
      optional binary value (STRING);
    }
  }
  required double value;
  required int64 timestamp (TIMESTAMP(MILLIS,true));
}

and I'm explicitly forcing the use of DELTA_BINARY_PACKED for timestamps
and BYTE_STREAM_SPLIT for the double values.

For metric names and label key/values, I'm using normal dictionary
encoding, and the cardinality of these is low in the sample data I'm
working with. So far so good. In terms of labels, each sample has a few 10s
of labels (eg: 26 in one of the worked examples below). Due to the sorting,
each data page will typically be made up of rows where the set of labels is
identical for every row. This means that the repetition level sequence for
each row will also look identical. And so, although RLE is in use and leads
to the repetitionLevels for a given row taking up 4 bytes, this 4 byte
sequence is then repeated ~3000 times. With v1 data pages, the whole-page
compression will naturally handle this incredibly well, as it's a best-case
scenario. But with v2 data pages, this block is left uncompressed, and ends
up being the largest contributor to the final file size, leading to files
that are 15x bigger than with v1 pages.

Here is some data.

With v1 pages
-------------------

Meta:

Row group 0:  count: 2741760  0.58 B records  start: 4  total(compressed):
1.526 MB total(uncompressed):126.067 MB
----------------------------------------------------------------------
                        type      encodings count     avg size   nulls
metric_name             BINARY    Z _ R     2741760   0.00 B     0
labels.key_value.key    BINARY    Z _ R     57358080  0.00 B     0
labels.key_value.value  BINARY    Z _ R     57358080  0.01 B     0
value                   DOUBLE    Z         2741760   0.32 B     0
timestamp               INT64     Z   D     2741760   0.03 B     0

and some debug data for a page:

ColumnChunkPageWriteStore: writePageV1: compressor: ZSTD, row count: 3787,
uncompressed size: 69016, compressed size: 267

With v2 pages
-------------------

Meta:

Row group 0:  count: 2741760  8.58 B records  start: 4  total(compressed):
22.427 MB total(uncompressed):126.167 MB
----------------------------------------------------------------------
                        type      encodings count     avg size   nulls
metric_name             BINARY    Z _ R     2741760   0.00 B     0
labels.key_value.key    BINARY    Z _ R     57358080  0.20 B     0
labels.key_value.value  BINARY    Z _ R     57358080  0.20 B     0
value                   DOUBLE    Z         2741760   0.32 B     0
timestamp               INT64     Z   D     2741760   0.03 B     0

and

ColumnChunkPageWriteStore: writePageV2: compressor: ZSTD, row count: 3773,
uncompressed size: 51612, compressed size: 264, repetitionLevels size
15092, definitionLevels size 4

So we can see that all the extra space is due to uncompressed repetition
levels. Is this use-cases considered pathological? I'm not sure how, but
maybe there's something else that will trip me up down the line that you
can tell me about. Similarly, maybe I'll discover that the decompression
overhead of v1 is so painful that this is unusable.

In the end, is this simply a case of "that's why v1 pages are still
supported" and I move ahead with that, or should it be possible for me to
use v2 pages, and something else is going wrong?

Thank you for your insights!

--phil

Reply via email to