Re: [PR] fix(spark): preserve raw number text in json_tuple to match Spark [datafusion]

via GitHub Mon, 30 Mar 2026 14:50:05 -0700


Copilot commented on code in PR #21264:
URL: https://github.com/apache/datafusion/pull/21264#discussion_r3012366734



##########
datafusion/spark/src/function/json/json_tuple.rs:
##########
@@ -144,14 +147,32 @@ fn json_tuple_inner(args: &[ArrayRef], return_type: 
&DataType) -> Result<ArrayRe
                     }
                     let field_name = field_arrays[field_idx].value(row_idx);
                     match map.get(field_name) {
-                        Some(serde_json::Value::Null) => {
-                            builder.append_null();
-                        }
-                        Some(serde_json::Value::String(s)) => {
-                            builder.append_value(s);
-                        }
-                        Some(other) => {
-                            builder.append_value(other.to_string());
+                        Some(raw) => {
+                            let raw_str = raw.get();
+                            if raw_str == "null" {
+                                builder.append_null();
+                            } else if raw_str.starts_with('"') {
+                                // String value: parse to unescape
+                                match serde_json::from_str::<String>(raw_str) {
+                                    Ok(s) => builder.append_value(s),
+                                    Err(_) => builder.append_value(raw_str),
+                                }
+                            } else {
+                                // Numbers, booleans, objects, arrays: raw text
+                                // Spark uppercases exponent in numeric 
literals:
+                                // 1.5e10 → 1.5E10
+                                // Only apply to numbers (not booleans like 
"false")

Review Comment:
   This implementation now returns the raw JSON token text not only for 
numbers, but also for booleans/objects/arrays 
(`builder.append_value(raw_str)`), whereas the previous `serde_json::Value` + 
`to_string()` path would have produced a canonicalized serialization (e.g., 
potentially different whitespace/key ordering). If the intended user-facing 
change is *only* number formatting/precision, consider keeping `to_string()` 
for non-number values or updating the PR description/docs to reflect the 
broader behavior change.



##########
datafusion/spark/src/function/json/json_tuple.rs:
##########
@@ -144,14 +147,32 @@ fn json_tuple_inner(args: &[ArrayRef], return_type: 
&DataType) -> Result<ArrayRe
                     }
                     let field_name = field_arrays[field_idx].value(row_idx);
                     match map.get(field_name) {
-                        Some(serde_json::Value::Null) => {
-                            builder.append_null();
-                        }
-                        Some(serde_json::Value::String(s)) => {
-                            builder.append_value(s);
-                        }
-                        Some(other) => {
-                            builder.append_value(other.to_string());
+                        Some(raw) => {
+                            let raw_str = raw.get();
+                            if raw_str == "null" {
+                                builder.append_null();
+                            } else if raw_str.starts_with('"') {
+                                // String value: parse to unescape
+                                match serde_json::from_str::<String>(raw_str) {
+                                    Ok(s) => builder.append_value(s),
+                                    Err(_) => builder.append_value(raw_str),
+                                }
+                            } else {
+                                // Numbers, booleans, objects, arrays: raw text
+                                // Spark uppercases exponent in numeric 
literals:
+                                // 1.5e10 → 1.5E10
+                                // Only apply to numbers (not booleans like 
"false")
+                                let first = raw_str.as_bytes().first();

Review Comment:
   `raw_str.as_bytes().first()` returns `Option<&u8>`, but `matches!(first, 
Some(b'0'..=b'9' | b'-'))` is matching as if it were `Option<u8>`. This should 
fail to compile due to the `&u8` vs `u8` mismatch. Consider using `.copied()` 
(or otherwise dereferencing) before matching so the pattern works on 
`Option<u8>`.
   ```suggestion
                                   let first = 
raw_str.as_bytes().first().copied();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(spark): preserve raw number text in json_tuple to match Spark [datafusion]

Reply via email to