Hi Alkis, Fortunately, our data is such that compression of repetition levels will always be highly effective and desirable, so it'll be sufficient to make a global decision to always use v1 pages.
Thanks, --phil On Wed, 9 Apr 2025 at 09:52, Alkis Evlogimenos <alkis.evlogime...@databricks.com.invalid> wrote: > Every v2 capable reader can decode v1 pages. With that in mind, could you > try to compress repetition levels and if effective, emit a v1 page, > otherwise emit a v2 page? You would lose the ability to compress or not the > data since v1 pages are either all compressed or all uncompressed. > > This is a bit hacky, but maybe it can address the issue for this particular > case? > > > On Wed, Apr 9, 2025 at 6:30 PM Phil Langdale <plangd...@roblox.com.invalid > > > wrote: > > > Hi Jan, > > > > Thanks for the details! I'm comfortable continuing with v1 pages. > > > > --phil > > > > > > On Tue, 8 Apr 2025 at 03:09, Jan Finis <jpfi...@gmail.com> wrote: > > > > > Hi Phil, > > > > > > I can just make an educated guess here: > > > I would think that in the design of v2 pages, it was indeed expected, > > that > > > RLE compresses the levels well enough to be sufficient and the fact > that > > no > > > decompression is needed would result in certain queries being faster. > > > That's why v2 doesn't compress the levels. > > > > > > But you are of course right that there are pathological cases, which > are > > > actually not too uncommon, where RLE doesn't compress well while a > > > compression like ZSTD would compress very well. So, one could argue > that > > v2 > > > never compressing levels is actually not a good design for this case; > it > > > would rather be good if v2 could decide to compress levels with a flag > > > (there is a flag whether the data is compressed, but sadly none for the > > > levels). But v2 is what it is and therefore doesn't compress levels. > So I > > > think your train of thought is correct and valid: For your specific use > > > case, v1 compresses way better than v2 and might therefore be superior. > > > > > > Note though that actually v1 is the default for most writers, so using > v2 > > > in the first place is still somewhat a niche thing. So, it's not only > > > "that's why v1 pages are still supported" but rather a "there is v2, > but > > > most people write v1 anyway". So there is nothing wrong with writing v1 > > > pages in your use case; they won't go away anytime soon and will likely > > be > > > supported by readers forever. > > > > > > Cheers, > > > Jan > > > > > > Am Mo., 7. Apr. 2025 um 19:16 Uhr schrieb Phil Langdale > > > <plangd...@roblox.com.invalid>: > > > > > > > Hi everyone, > > > > > > > > This is my first time I've had to look deeply into Parquet internals, > > so > > > > let me apologise if this has been discussed elsewhere in the past. > I've > > > > tried to do my due diligence in terms of searching online but I > haven't > > > > found a clear answer. > > > > > > > > I'm currently trying to define a suitable parquet schema for storing > > > > Prometheus originated time-series data, focusing on space > efficiency. I > > > > have what looks like a highly effective solution using V1 data pages, > > but > > > > with V2, the lack of compression of repetition levels results in a > > > massive > > > > loss of comparative efficiency. I'm trying to understand whether this > > > > behaviour was considered in the original V2 design, and whether I'm > > > missing > > > > something in how I'm trying to use the format. Is continuing to use > V1 > > > data > > > > pages the correct solution for my use-case? > > > > > > > > Here are my specifics: > > > > > > > > * I am exporting data from Prometheus that is already well sorted. If > > it > > > > wasn't, I would do the sorting myself. This ensures that metrics are > > > > initially sorted by name, then by the set of labels, then by > timestamp. > > > > This should lead to best case data for encoding and compression, and > my > > > > results support this. > > > > > > > > I have the following schema: > > > > > > > > message spark_schema { > > > > required binary metric_name (STRING); > > > > required group labels (MAP) { > > > > repeated group key_value { > > > > required binary key (STRING); > > > > optional binary value (STRING); > > > > } > > > > } > > > > required double value; > > > > required int64 timestamp (TIMESTAMP(MILLIS,true)); > > > > } > > > > > > > > and I'm explicitly forcing the use of DELTA_BINARY_PACKED for > > timestamps > > > > and BYTE_STREAM_SPLIT for the double values. > > > > > > > > For metric names and label key/values, I'm using normal dictionary > > > > encoding, and the cardinality of these is low in the sample data I'm > > > > working with. So far so good. In terms of labels, each sample has a > few > > > 10s > > > > of labels (eg: 26 in one of the worked examples below). Due to the > > > sorting, > > > > each data page will typically be made up of rows where the set of > > labels > > > is > > > > identical for every row. This means that the repetition level > sequence > > > for > > > > each row will also look identical. And so, although RLE is in use and > > > leads > > > > to the repetitionLevels for a given row taking up 4 bytes, this 4 > byte > > > > sequence is then repeated ~3000 times. With v1 data pages, the > > whole-page > > > > compression will naturally handle this incredibly well, as it's a > > > best-case > > > > scenario. But with v2 data pages, this block is left uncompressed, > and > > > ends > > > > up being the largest contributor to the final file size, leading to > > files > > > > that are 15x bigger than with v1 pages. > > > > > > > > Here is some data. > > > > > > > > With v1 pages > > > > ------------------- > > > > > > > > Meta: > > > > > > > > Row group 0: count: 2741760 0.58 B records start: 4 > > > total(compressed): > > > > 1.526 MB total(uncompressed):126.067 MB > > > > > ---------------------------------------------------------------------- > > > > type encodings count avg size > nulls > > > > metric_name BINARY Z _ R 2741760 0.00 B 0 > > > > labels.key_value.key BINARY Z _ R 57358080 0.00 B 0 > > > > labels.key_value.value BINARY Z _ R 57358080 0.01 B 0 > > > > value DOUBLE Z 2741760 0.32 B 0 > > > > timestamp INT64 Z D 2741760 0.03 B 0 > > > > > > > > and some debug data for a page: > > > > > > > > ColumnChunkPageWriteStore: writePageV1: compressor: ZSTD, row count: > > > 3787, > > > > uncompressed size: 69016, compressed size: 267 > > > > > > > > With v2 pages > > > > ------------------- > > > > > > > > Meta: > > > > > > > > Row group 0: count: 2741760 8.58 B records start: 4 > > > total(compressed): > > > > 22.427 MB total(uncompressed):126.167 MB > > > > > ---------------------------------------------------------------------- > > > > type encodings count avg size > nulls > > > > metric_name BINARY Z _ R 2741760 0.00 B 0 > > > > labels.key_value.key BINARY Z _ R 57358080 0.20 B 0 > > > > labels.key_value.value BINARY Z _ R 57358080 0.20 B 0 > > > > value DOUBLE Z 2741760 0.32 B 0 > > > > timestamp INT64 Z D 2741760 0.03 B 0 > > > > > > > > and > > > > > > > > ColumnChunkPageWriteStore: writePageV2: compressor: ZSTD, row count: > > > 3773, > > > > uncompressed size: 51612, compressed size: 264, repetitionLevels size > > > > 15092, definitionLevels size 4 > > > > > > > > So we can see that all the extra space is due to uncompressed > > repetition > > > > levels. Is this use-cases considered pathological? I'm not sure how, > > but > > > > maybe there's something else that will trip me up down the line that > > you > > > > can tell me about. Similarly, maybe I'll discover that the > > decompression > > > > overhead of v1 is so painful that this is unusable. > > > > > > > > In the end, is this simply a case of "that's why v1 pages are still > > > > supported" and I move ahead with that, or should it be possible for > me > > to > > > > use v2 pages, and something else is going wrong? > > > > > > > > Thank you for your insights! > > > > > > > > --phil > > > > > > > > > >