Hi everyone, This is my first time I've had to look deeply into Parquet internals, so let me apologise if this has been discussed elsewhere in the past. I've tried to do my due diligence in terms of searching online but I haven't found a clear answer.
I'm currently trying to define a suitable parquet schema for storing Prometheus originated time-series data, focusing on space efficiency. I have what looks like a highly effective solution using V1 data pages, but with V2, the lack of compression of repetition levels results in a massive loss of comparative efficiency. I'm trying to understand whether this behaviour was considered in the original V2 design, and whether I'm missing something in how I'm trying to use the format. Is continuing to use V1 data pages the correct solution for my use-case? Here are my specifics: * I am exporting data from Prometheus that is already well sorted. If it wasn't, I would do the sorting myself. This ensures that metrics are initially sorted by name, then by the set of labels, then by timestamp. This should lead to best case data for encoding and compression, and my results support this. I have the following schema: message spark_schema { required binary metric_name (STRING); required group labels (MAP) { repeated group key_value { required binary key (STRING); optional binary value (STRING); } } required double value; required int64 timestamp (TIMESTAMP(MILLIS,true)); } and I'm explicitly forcing the use of DELTA_BINARY_PACKED for timestamps and BYTE_STREAM_SPLIT for the double values. For metric names and label key/values, I'm using normal dictionary encoding, and the cardinality of these is low in the sample data I'm working with. So far so good. In terms of labels, each sample has a few 10s of labels (eg: 26 in one of the worked examples below). Due to the sorting, each data page will typically be made up of rows where the set of labels is identical for every row. This means that the repetition level sequence for each row will also look identical. And so, although RLE is in use and leads to the repetitionLevels for a given row taking up 4 bytes, this 4 byte sequence is then repeated ~3000 times. With v1 data pages, the whole-page compression will naturally handle this incredibly well, as it's a best-case scenario. But with v2 data pages, this block is left uncompressed, and ends up being the largest contributor to the final file size, leading to files that are 15x bigger than with v1 pages. Here is some data. With v1 pages ------------------- Meta: Row group 0: count: 2741760 0.58 B records start: 4 total(compressed): 1.526 MB total(uncompressed):126.067 MB ---------------------------------------------------------------------- type encodings count avg size nulls metric_name BINARY Z _ R 2741760 0.00 B 0 labels.key_value.key BINARY Z _ R 57358080 0.00 B 0 labels.key_value.value BINARY Z _ R 57358080 0.01 B 0 value DOUBLE Z 2741760 0.32 B 0 timestamp INT64 Z D 2741760 0.03 B 0 and some debug data for a page: ColumnChunkPageWriteStore: writePageV1: compressor: ZSTD, row count: 3787, uncompressed size: 69016, compressed size: 267 With v2 pages ------------------- Meta: Row group 0: count: 2741760 8.58 B records start: 4 total(compressed): 22.427 MB total(uncompressed):126.167 MB ---------------------------------------------------------------------- type encodings count avg size nulls metric_name BINARY Z _ R 2741760 0.00 B 0 labels.key_value.key BINARY Z _ R 57358080 0.20 B 0 labels.key_value.value BINARY Z _ R 57358080 0.20 B 0 value DOUBLE Z 2741760 0.32 B 0 timestamp INT64 Z D 2741760 0.03 B 0 and ColumnChunkPageWriteStore: writePageV2: compressor: ZSTD, row count: 3773, uncompressed size: 51612, compressed size: 264, repetitionLevels size 15092, definitionLevels size 4 So we can see that all the extra space is due to uncompressed repetition levels. Is this use-cases considered pathological? I'm not sure how, but maybe there's something else that will trip me up down the line that you can tell me about. Similarly, maybe I'll discover that the decompression overhead of v1 is so painful that this is unusable. In the end, is this simply a case of "that's why v1 pages are still supported" and I move ahead with that, or should it be possible for me to use v2 pages, and something else is going wrong? Thank you for your insights! --phil