GitHub user mobiusklein closed a discussion: Sorting and encoding of dictionary 
pages

I am working on a schema for storing multi-dimensional signal (float32/float64 
type) from a mass spectrometer in Parquet, using Rust for the prototypal 
implementation. In some cases, dictionary encoding has helped as one 
dimension's values are repeated many times as another dimension varies.

Sometimes, that dictionary can be quite big, and it may not compress well on 
its own if stored with the plain encoding, based upon experiments where I 
collected the set of all unique values and just compressed them with Zstd. If I 
sorted them and byte shuffled them (as in the `BYTE_STREAM_SPLIT` encoding), 
the compression improves substantially. Reading 
https://parquet.apache.org/docs/file-format/metadata/#page-header, it looks 
like there is a flag on the `DictionaryPageHeader` that says if the dictionary 
is sorted, and that the dictionary page can be encoded. 

So far as I can tell, this implementation doesn't support writing out a sorted 
dictionary, or using any encoding other than `PLAIN` on that page, correct? Is 
this something that is technically supported but not implemented here, or not 
supported by the Parquet format definition?

If I understand what'd be involved, sorting the dictionary page wouldn't force 
new behavior on a reader, but using an encoding other than `PLAIN` might?

GitHub link: https://github.com/apache/arrow-rs/discussions/8778

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to