etseidl commented on issue #6867: URL: https://github.com/apache/arrow-rs/issues/6867#issuecomment-2536687828
> It would be amazing if we figured out how to do this truncation / incrementing / decrementing correctly in once place (the parquet crate) and then just reused the same logic in datafusion They are slightly different use cases, though. Here we're taking an N-character M-byte string and truncating it to no larger than T bytes, but it may be smaller due to character boundaries, and then incrementing the final character if possible. Since we're constrained by the size of the vector of bytes we're operating over, we can't promote a 2-byte character to a 3-byte, and so get a less ideal bound. What @adriangb et al are doing in datafusion is a bit different. There they have a prefix that they want to increment, but they're not constrained by size, so are free to switch to wider characters if necessary. We could do the same in parquet-rs if we were willing to have a truncated max statistic that's 1 byte larger than requested (which seems ok to me as long as it's communicated that the truncation is a best effort, just like with page and row group sizes). I've submitted #6870 which continues with the do-no-overshoot approach. If we want to relax the bounds a bit, then we could adopt what's being proposed in https://github.com/apache/datafusion/pull/12978. I'd also like to do some testing to see if there are performance impacts (although I'd expect these to be minimal given the truncation happens at most once per page and column chunk). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
