Re: [PR] feat(parquet): dictionary fallback heuristic based on compression efficiency [arrow-rs]

via GitHub Thu, 16 Apr 2026 15:34:32 -0700


etseidl commented on PR #9700:
URL: https://github.com/apache/arrow-rs/pull/9700#issuecomment-4263852617


   > > If V2 page headers are enabled, I believe we fallback to one of the 
delta encodings (at least for ints and byte arrays). Estimating those sizes 
might be a good deal harder.
   > 
   > Since this is only a heuristic, and the wrong decision is not fatal, I 
thought that the estimation does not have to be perfect. The plain encoded size 
is easy and quick to compute – no need to even read the values for fixed-length 
types – and it gives a good approximation of the worst case (all the other 
encodings were invented to improve over the plain one, after all). I'll think 
of further developing this by giving a cheaply computed upper size bound for 
the actually used fallback encoding, but I don't want to make it too precise at 
the cost of extra computation and memory reads.
   
   I think that's fine for now, and probably always ok for string columns 
(well, if they fallback to DELTA_LENGTH_BYTE_ARRAY at least). And as you say, 
the worst case here is sticking with dictionary when perhaps 
DELTA_BINARY_PACKED might be superior.  Then again, these are just defaults, 
and power users should know their data and pick encodings appropriate to their 
use cases. (Or use something like 
https://github.com/XiangpengHao/parquet-linter)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(parquet): dictionary fallback heuristic based on compression efficiency [arrow-rs]

Reply via email to