etseidl commented on PR #9700: URL: https://github.com/apache/arrow-rs/pull/9700#issuecomment-4263852617
> > If V2 page headers are enabled, I believe we fallback to one of the delta encodings (at least for ints and byte arrays). Estimating those sizes might be a good deal harder. > > Since this is only a heuristic, and the wrong decision is not fatal, I thought that the estimation does not have to be perfect. The plain encoded size is easy and quick to compute – no need to even read the values for fixed-length types – and it gives a good approximation of the worst case (all the other encodings were invented to improve over the plain one, after all). I'll think of further developing this by giving a cheaply computed upper size bound for the actually used fallback encoding, but I don't want to make it too precise at the cost of extra computation and memory reads. I think that's fine for now, and probably always ok for string columns (well, if they fallback to DELTA_LENGTH_BYTE_ARRAY at least). And as you say, the worst case here is sticking with dictionary when perhaps DELTA_BINARY_PACKED might be superior. Then again, these are just defaults, and power users should know their data and pick encodings appropriate to their use cases. (Or use something like https://github.com/XiangpengHao/parquet-linter) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
