adamreeve opened a new pull request, #47998: URL: https://github.com/apache/arrow/pull/47998
### Rationale for this change Prevents silently writing invalid data when using dictionary encoding and the number of bits in the estimated max buffer size is greater than the max int32 value. Also fixes an overflow resulting in a "Negative buffer resize" error if the buffer size in bytes is greater than max int32, and instead throw a more helpful exception. ### What changes are included in this PR? * Fix overflow when computing the bit position in `BitWriter::PutValue`. This overflow would cause the method to return without writing data, and the return value is only checked in debug builds. * Change buffer size calculations to use int64 and check for overflow before casting to int ### Are these changes tested? Yes, I've added unit tests for both issues. These require enabling `ARROW_LARGE_MEMORY_TESTS` as they allocate a lot of memory. ### Are there any user-facing changes? **This PR contains a "Critical Fix".** This fixes a bug where invalid Parquet files can be silently written when the buffer size for dictionary indices is large. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
