GitHub user mobiusklein closed a discussion: Sorting and encoding of dictionary pages
I am working on a schema for storing multi-dimensional signal (float32/float64 type) from a mass spectrometer in Parquet, using Rust for the prototypal implementation. In some cases, dictionary encoding has helped as one dimension's values are repeated many times as another dimension varies. Sometimes, that dictionary can be quite big, and it may not compress well on its own if stored with the plain encoding, based upon experiments where I collected the set of all unique values and just compressed them with Zstd. If I sorted them and byte shuffled them (as in the `BYTE_STREAM_SPLIT` encoding), the compression improves substantially. Reading https://parquet.apache.org/docs/file-format/metadata/#page-header, it looks like there is a flag on the `DictionaryPageHeader` that says if the dictionary is sorted, and that the dictionary page can be encoded. So far as I can tell, this implementation doesn't support writing out a sorted dictionary, or using any encoding other than `PLAIN` on that page, correct? Is this something that is technically supported but not implemented here, or not supported by the Parquet format definition? If I understand what'd be involved, sorting the dictionary page wouldn't force new behavior on a reader, but using an encoding other than `PLAIN` might? GitHub link: https://github.com/apache/arrow-rs/discussions/8778 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
