alamb commented on code in PR #7721: URL: https://github.com/apache/arrow-rs/pull/7721#discussion_r2160008103
########## arrow-json/src/reader/tape.rs: ########## @@ -705,9 +705,16 @@ fn err(b: u8, ctx: &str) -> ArrowError { /// Creates a character from an UTF-16 surrogate pair fn char_from_surrogate_pair(low: u16, high: u16) -> Result<char, ArrowError> { - let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000); - char::from_u32(n) - .ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}"))) + match (low, high) { + (0xDC00..=0xDFFF, 0xD800..=0xDBFF) => { + let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000); + char::from_u32(n) + .ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}"))) + } + _ => Err(ArrowError::JsonError(format!( + "Invalid UTF-16 surrogate pair. High: {high:#02X}, Low: {low:#02X}" + ))), + } Review Comment: For what it is worth, I think the current way this is encoded is quite understandable and will generate clear error messages. I double checked and it conforms to my reading of UTF-16 on https://en.wikipedia.org/wiki/UTF-16 ########## arrow-json/src/reader/tape.rs: ########## @@ -951,4 +958,15 @@ mod tests { let err = decoder.finish().unwrap_err().to_string(); assert_eq!(err, "Json error: Encountered truncated UTF-8 sequence"); } + + #[test] + fn test_invalid_surrogates() { Review Comment: I verified this test panic's on main: ``` attempt to subtract with overflow thread 'reader::tape::tests::test_invalid_surrogates' panicked at arrow-json/src/reader/tape.rs:708:49: attempt to subtract with overflow ``` And the test actually fails to detect an error in release mode ``` ---- reader::tape::tests::test_invalid_surrogates stdout ---- thread 'reader::tape::tests::test_invalid_surrogates' panicked at arrow-json/src/reader/tape.rs:959:9: assertion failed: res.is_err() note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org