Re: [I] Parquet UTF-8 max statistics are overly pessimistic [arrow-rs]

via GitHub Wed, 11 Dec 2024 09:46:40 -0800


etseidl commented on issue #6867:
URL: https://github.com/apache/arrow-rs/issues/6867#issuecomment-2536687828


   > It would be amazing if we figured out how to do this truncation / 
incrementing / decrementing correctly in once place (the parquet crate) and 
then just reused the same logic in datafusion
   
   They are slightly different use cases, though. Here we're taking an 
N-character M-byte string and truncating it to no larger than T bytes, but it 
may be smaller due to character boundaries, and then incrementing the final 
character if possible. Since we're constrained by the size of the vector of 
bytes we're operating over, we can't promote a 2-byte character to a 3-byte, 
and so get a less ideal bound. What @adriangb et al are doing in datafusion is 
a bit different. There they have a prefix that they want to increment, but 
they're not constrained by size, so are free to switch to wider characters if 
necessary. We could do the same in parquet-rs if we were willing to have a 
truncated max statistic that's 1 byte larger than requested (which seems ok to 
me as long as it's communicated that the truncation is a best effort, just like 
with page and row group sizes).
   
   I've submitted #6870 which continues with the do-no-overshoot approach. If 
we want to relax the bounds a bit, then we could adopt what's being proposed in 
https://github.com/apache/datafusion/pull/12978. I'd also like to do some 
testing to see if there are performance impacts (although I'd expect these to 
be minimal given the truncation happens at most once per page and column chunk).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parquet UTF-8 max statistics are overly pessimistic [arrow-rs]

Reply via email to