etseidl opened a new issue, #6867: URL: https://github.com/apache/arrow-rs/issues/6867
**Describe the bug** A recent [comment](https://github.com/apache/datafusion/pull/12978#discussion_r1823412482) regarding incrementing UTF-8 encoded characters got me wondering about the implementation in parquet-rs. It turns out that the `increment_utf8` function in the `parquet::column::writer` module isn't optimal. Take for example the Unicode character `U+00FF` ('ÿ'), which is encoded as `0xC3BF` or `b11000111 b10111111`. Incrementing this code point by one should yield `U+0100` (`0xC480` or 'Ā'). `increment_utf8`, however, will yield instead `U+013F` ('Ŀ'), due to how overflow is handled. https://github.com/apache/arrow-rs/blob/08ac58761cb9a80b6ffd651727aa957db0c32bd7/parquet/src/column/writer/mod.rs#L1444-L1451 What happens in this case is the lower byte `0xBF` is incremented to `0xC0` which is not a valid UTF-8 continuation character. The byte is then left alone and the next higher byte `0xC3` is incremented to `0xC4`, yielding a final UTF-8 encoded char of `0xC4BF` which is Unicode `U+013F`. This is analogous to incrementing the high nybble of an ASCII character, for instance 'O' (`0x4F`) becoming '_' (`0x5F`) rather than 'P' (`0x50`). The end result is still a valid upper bound, but it is not as good as it could be and could result in some row groups or pages being read when they don't actually need to be. **To Reproduce** Add the following to `test_increment_utf8` and it will pass. ```rust let s = "\u{ff}\u{ff}"; let s1 = "\u{ff}\u{100}"; let s2 = "\u{ff}\u{13f}"; let v = increment_utf8(s.as_bytes().to_vec()).unwrap(); assert_eq!(v, s2.as_bytes()); assert_ne!(v, s1.as_bytes()); ``` **Expected behavior** The above test should fail, i.e. the incremented string should be "ÿĀ" rather than "ÿĿ". **Additional context** One possible fix would be to do something more akin to what is being done in https://github.com/apache/datafusion/pull/12978, where the string data is first converted to chars, incremented, and then converted back to UTF-8. This can result in some characters increasing in size, which may not be desired here considering we're trying to truncate to less than a certain number of bytes. Another alternative is to set overflowed continuation bytes to `0x80` (the minimum valid continuation byte) when moving on to the next higher byte. If the character is at a byte width boundary, the bytes for that character can be set to `0` and the next higher character can be incremented. This would not increase the byte count, but can result in less optimal upper bounds (but still better than the current ones). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
