alamb commented on code in PR #7721:
URL: https://github.com/apache/arrow-rs/pull/7721#discussion_r2160008103


##########
arrow-json/src/reader/tape.rs:
##########
@@ -705,9 +705,16 @@ fn err(b: u8, ctx: &str) -> ArrowError {
 
 /// Creates a character from an UTF-16 surrogate pair
 fn char_from_surrogate_pair(low: u16, high: u16) -> Result<char, ArrowError> {
-    let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 
0x1_0000);
-    char::from_u32(n)
-        .ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate 
pair {n}")))
+    match (low, high) {
+        (0xDC00..=0xDFFF, 0xD800..=0xDBFF) => {
+            let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 
+ 0x1_0000);
+            char::from_u32(n)
+                .ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 
surrogate pair {n}")))
+        }
+        _ => Err(ArrowError::JsonError(format!(
+            "Invalid UTF-16 surrogate pair. High: {high:#02X}, Low: {low:#02X}"
+        ))),
+    }

Review Comment:
   For what it is worth, I think the current way this is encoded is quite 
understandable and will generate clear error messages. I double checked and it 
conforms to my reading of UTF-16 on https://en.wikipedia.org/wiki/UTF-16



##########
arrow-json/src/reader/tape.rs:
##########
@@ -951,4 +958,15 @@ mod tests {
         let err = decoder.finish().unwrap_err().to_string();
         assert_eq!(err, "Json error: Encountered truncated UTF-8 sequence");
     }
+
+    #[test]
+    fn test_invalid_surrogates() {

Review Comment:
   I verified this test panic's on main:
   ```
   attempt to subtract with overflow
   thread 'reader::tape::tests::test_invalid_surrogates' panicked at 
arrow-json/src/reader/tape.rs:708:49:
   attempt to subtract with overflow
   ```
   
   And the test actually fails to detect an error in release mode 
   ```
   
   ---- reader::tape::tests::test_invalid_surrogates stdout ----
   
   thread 'reader::tape::tests::test_invalid_surrogates' panicked at 
arrow-json/src/reader/tape.rs:959:9:
   assertion failed: res.is_err()
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to