XiangpengHao commented on issue #7739:
URL: https://github.com/apache/arrow-rs/issues/7739#issuecomment-3001804953

   According to [this paper](https://www.vldb.org/pvldb/vol17/p148-zeng.pdf) by 
@XinyuZeng :
   
   > For integer columns, Parquet first dictionary encodes and then
   applies a hybrid of RLE and Bitpacking to the dictionary codes.
   If the same value repeats ≥ 8 times consecutively, it uses RLE;
   otherwise, it uses bitpacking. Interestingly, we found that the RLE-
   threshold 8 is a non-configurable parameter hard-coded in every
   implementation of Parquet. Although it saves Parquet a tuning
   knob, such inflexibility could lead to suboptimal compression ratios
   for specific data sets (e.g., when the common repetition length is 7).
   
   so this magic number of 8 seems to be a convention and not part of the spec. 
means we can definitely tune it, but needs a interface to expose this 
configuration
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to