XiangpengHao commented on issue #7739: URL: https://github.com/apache/arrow-rs/issues/7739#issuecomment-3001804953
According to [this paper](https://www.vldb.org/pvldb/vol17/p148-zeng.pdf) by @XinyuZeng : > For integer columns, Parquet first dictionary encodes and then applies a hybrid of RLE and Bitpacking to the dictionary codes. If the same value repeats ≥ 8 times consecutively, it uses RLE; otherwise, it uses bitpacking. Interestingly, we found that the RLE- threshold 8 is a non-configurable parameter hard-coded in every implementation of Parquet. Although it saves Parquet a tuning knob, such inflexibility could lead to suboptimal compression ratios for specific data sets (e.g., when the common repetition length is 7). so this magic number of 8 seems to be a convention and not part of the spec. means we can definitely tune it, but needs a interface to expose this configuration -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
