etseidl opened a new issue, #6867:
URL: https://github.com/apache/arrow-rs/issues/6867

   **Describe the bug**
   A recent 
[comment](https://github.com/apache/datafusion/pull/12978#discussion_r1823412482)
 regarding incrementing UTF-8 encoded characters got me wondering about the 
implementation in parquet-rs. It turns out that the `increment_utf8` function 
in the `parquet::column::writer` module isn't optimal.
   
   Take for example the Unicode character `U+00FF` ('ÿ'), which is encoded as 
`0xC3BF` or `b11000111 b10111111`. Incrementing this code point by one should 
yield `U+0100` (`0xC480` or 'Ā'). `increment_utf8`, however, will yield instead 
`U+013F` ('Ŀ'), due to how overflow is handled.
   
   
https://github.com/apache/arrow-rs/blob/08ac58761cb9a80b6ffd651727aa957db0c32bd7/parquet/src/column/writer/mod.rs#L1444-L1451
   
   What happens in this case is the lower byte `0xBF` is incremented to `0xC0` 
which is not a valid UTF-8 continuation character. The byte is then left alone 
and the next higher byte `0xC3` is incremented to `0xC4`, yielding a final 
UTF-8 encoded char of `0xC4BF` which is Unicode `U+013F`. This is analogous to 
incrementing the high nybble of an ASCII character, for instance 'O' (`0x4F`) 
becoming '_' (`0x5F`) rather than 'P' (`0x50`).
   
   The end result is still a valid upper bound, but it is not as good as it 
could be and could result in some row groups or pages being read when they 
don't actually need to be.
   
   **To Reproduce**
   Add the following to `test_increment_utf8` and it will pass.
   ```rust
           let s = "\u{ff}\u{ff}";
           let s1 = "\u{ff}\u{100}";
           let s2 = "\u{ff}\u{13f}";
           let v = increment_utf8(s.as_bytes().to_vec()).unwrap();
           assert_eq!(v, s2.as_bytes());
           assert_ne!(v, s1.as_bytes());
   ```
   
   **Expected behavior**
   The above test should fail, i.e. the incremented string should be "ÿĀ" 
rather than "ÿĿ".
   
   **Additional context**
   One possible fix would be to do something more akin to what is being done in 
https://github.com/apache/datafusion/pull/12978, where the string data is first 
converted to chars, incremented, and then converted back to UTF-8. This can 
result in some characters increasing in size, which may not be desired here 
considering we're trying to truncate to less than a certain number of bytes. 
Another alternative is to set overflowed continuation bytes to `0x80` (the 
minimum valid continuation byte) when moving on to the next higher byte. If the 
character is at a byte width boundary, the bytes for that character can be set 
to `0` and the next higher character can be incremented. This would not 
increase the byte count, but can result in less optimal upper bounds (but still 
better than the current ones).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to