Hi Jan, Thanks for the details! I'm comfortable continuing with v1 pages.
--phil On Tue, 8 Apr 2025 at 03:09, Jan Finis <jpfi...@gmail.com> wrote: > Hi Phil, > > I can just make an educated guess here: > I would think that in the design of v2 pages, it was indeed expected, that > RLE compresses the levels well enough to be sufficient and the fact that no > decompression is needed would result in certain queries being faster. > That's why v2 doesn't compress the levels. > > But you are of course right that there are pathological cases, which are > actually not too uncommon, where RLE doesn't compress well while a > compression like ZSTD would compress very well. So, one could argue that v2 > never compressing levels is actually not a good design for this case; it > would rather be good if v2 could decide to compress levels with a flag > (there is a flag whether the data is compressed, but sadly none for the > levels). But v2 is what it is and therefore doesn't compress levels. So I > think your train of thought is correct and valid: For your specific use > case, v1 compresses way better than v2 and might therefore be superior. > > Note though that actually v1 is the default for most writers, so using v2 > in the first place is still somewhat a niche thing. So, it's not only > "that's why v1 pages are still supported" but rather a "there is v2, but > most people write v1 anyway". So there is nothing wrong with writing v1 > pages in your use case; they won't go away anytime soon and will likely be > supported by readers forever. > > Cheers, > Jan > > Am Mo., 7. Apr. 2025 um 19:16 Uhr schrieb Phil Langdale > <plangd...@roblox.com.invalid>: > > > Hi everyone, > > > > This is my first time I've had to look deeply into Parquet internals, so > > let me apologise if this has been discussed elsewhere in the past. I've > > tried to do my due diligence in terms of searching online but I haven't > > found a clear answer. > > > > I'm currently trying to define a suitable parquet schema for storing > > Prometheus originated time-series data, focusing on space efficiency. I > > have what looks like a highly effective solution using V1 data pages, but > > with V2, the lack of compression of repetition levels results in a > massive > > loss of comparative efficiency. I'm trying to understand whether this > > behaviour was considered in the original V2 design, and whether I'm > missing > > something in how I'm trying to use the format. Is continuing to use V1 > data > > pages the correct solution for my use-case? > > > > Here are my specifics: > > > > * I am exporting data from Prometheus that is already well sorted. If it > > wasn't, I would do the sorting myself. This ensures that metrics are > > initially sorted by name, then by the set of labels, then by timestamp. > > This should lead to best case data for encoding and compression, and my > > results support this. > > > > I have the following schema: > > > > message spark_schema { > > required binary metric_name (STRING); > > required group labels (MAP) { > > repeated group key_value { > > required binary key (STRING); > > optional binary value (STRING); > > } > > } > > required double value; > > required int64 timestamp (TIMESTAMP(MILLIS,true)); > > } > > > > and I'm explicitly forcing the use of DELTA_BINARY_PACKED for timestamps > > and BYTE_STREAM_SPLIT for the double values. > > > > For metric names and label key/values, I'm using normal dictionary > > encoding, and the cardinality of these is low in the sample data I'm > > working with. So far so good. In terms of labels, each sample has a few > 10s > > of labels (eg: 26 in one of the worked examples below). Due to the > sorting, > > each data page will typically be made up of rows where the set of labels > is > > identical for every row. This means that the repetition level sequence > for > > each row will also look identical. And so, although RLE is in use and > leads > > to the repetitionLevels for a given row taking up 4 bytes, this 4 byte > > sequence is then repeated ~3000 times. With v1 data pages, the whole-page > > compression will naturally handle this incredibly well, as it's a > best-case > > scenario. But with v2 data pages, this block is left uncompressed, and > ends > > up being the largest contributor to the final file size, leading to files > > that are 15x bigger than with v1 pages. > > > > Here is some data. > > > > With v1 pages > > ------------------- > > > > Meta: > > > > Row group 0: count: 2741760 0.58 B records start: 4 > total(compressed): > > 1.526 MB total(uncompressed):126.067 MB > > ---------------------------------------------------------------------- > > type encodings count avg size nulls > > metric_name BINARY Z _ R 2741760 0.00 B 0 > > labels.key_value.key BINARY Z _ R 57358080 0.00 B 0 > > labels.key_value.value BINARY Z _ R 57358080 0.01 B 0 > > value DOUBLE Z 2741760 0.32 B 0 > > timestamp INT64 Z D 2741760 0.03 B 0 > > > > and some debug data for a page: > > > > ColumnChunkPageWriteStore: writePageV1: compressor: ZSTD, row count: > 3787, > > uncompressed size: 69016, compressed size: 267 > > > > With v2 pages > > ------------------- > > > > Meta: > > > > Row group 0: count: 2741760 8.58 B records start: 4 > total(compressed): > > 22.427 MB total(uncompressed):126.167 MB > > ---------------------------------------------------------------------- > > type encodings count avg size nulls > > metric_name BINARY Z _ R 2741760 0.00 B 0 > > labels.key_value.key BINARY Z _ R 57358080 0.20 B 0 > > labels.key_value.value BINARY Z _ R 57358080 0.20 B 0 > > value DOUBLE Z 2741760 0.32 B 0 > > timestamp INT64 Z D 2741760 0.03 B 0 > > > > and > > > > ColumnChunkPageWriteStore: writePageV2: compressor: ZSTD, row count: > 3773, > > uncompressed size: 51612, compressed size: 264, repetitionLevels size > > 15092, definitionLevels size 4 > > > > So we can see that all the extra space is due to uncompressed repetition > > levels. Is this use-cases considered pathological? I'm not sure how, but > > maybe there's something else that will trip me up down the line that you > > can tell me about. Similarly, maybe I'll discover that the decompression > > overhead of v1 is so painful that this is unusable. > > > > In the end, is this simply a case of "that's why v1 pages are still > > supported" and I move ahead with that, or should it be possible for me to > > use v2 pages, and something else is going wrong? > > > > Thank you for your insights! > > > > --phil > > >