Re: [I] Parsing a string column containing JSON values into a typed array [arrow-rs]

via GitHub Wed, 05 Feb 2025 15:43:01 -0800


scovich commented on issue #6522:
URL: https://github.com/apache/arrow-rs/issues/6522#issuecomment-2638283760


   > As far as I know, arrow json reader is for cases where you have a json for 
each row of a table; but I have a column of jsons and wanted to keep them all 
in a single column
   
   That's definitely the advertized use case for the arrow json reader. But it 
turns out that an arrow `RecordBatch` is almost (not quite) a `StructArray` if 
you 
[squint](https://docs.rs/arrow/latest/arrow/array/struct.StructArray.html#comparison-with-recordbatch),
 and they support both 
[forward](https://docs.rs/arrow/latest/arrow/record_batch/struct.RecordBatch.html#impl-From%3CStructArray%3E-for-RecordBatch)
 and 
[reverse](https://docs.rs/arrow/latest/arrow/array/struct.StructArray.html#impl-From%3CRecordBatch%3E-for-StructArray)
 conversions.
   
   Consider the following newline-delimited JSON string:
   ```
   { "a": { "b": 10, "c": 20 } }
   { "a": { "b": 30, "c": 40 } }
   ```
   
   If we wrote those 60 bytes out to a file, the 
[arrow_json::reader::ReaderBuilder::build_decoder](https://docs.rs/arrow-json/50.0.0/arrow_json/reader/struct.ReaderBuilder.html#method.build_decoder)
 could read the bytes back in and parse them into a 2-row `RecordBatch` (which 
started out internally as a `StructArray` and can be converted back to a 
`StructArray` if desired).
   
   We could also take those exact same 60 bytes and store them in a 
`StringArray`, using offsets `[0, 30, 60]` to delimit the records... but then 
there's no way to reliably parse the result into a `StructArray` because the 
decoder doesn't support arrow lists as input and also doesn't expose enough 
state for an outsider to manage the decoding manually in a robust way. 
   
   NOTE: We could _almost_ get it working by observing that `StringArray` is 
really just a 
[GenericByteArray](https://docs.rs/arrow/latest/arrow/array/struct.GenericByteArray.html),
 and the backing bytes are a contiguous array we could wrap up as a `Reader`. 
But there's no guarantee such a string array had the newline endings that the 
public json reader expects, which messes up error handling:
   
   <details>
   
   The following executable example demonstrates the robustness problem:
   ```rust
   use arrow_array::{GenericByteArray, RecordBatch};
   use arrow_array::types::Utf8Type;
   use arrow_json::ReaderBuilder;
   use arrow_schema::{DataType, Field, Schema};
   use std::io::BufReader;
   use std::sync::Arc;
   
   fn read_json(bytes: &[u8]) -> RecordBatch {
       let fields = vec![
           Field::new("b", DataType::Int64, false),
           Field::new("c", DataType::Int64, false),
       ];
       let schema = Arc::new(Schema::new(vec![
           Field::new("a", DataType::Struct(fields.into_iter().collect()), 
false),
       ]));
   
       
ReaderBuilder::new(schema).build(BufReader::new(bytes)).unwrap().next().unwrap().unwrap()
   }
   
   #[test]
   fn test_json_parse() {
       let strings = vec![
           r#"{ "a": { "b": 10, "c": 20 } }"#,
           r#"{ "a": { "b": 30, "c": 40 } }"#,
       ];
       let array: GenericByteArray<Utf8Type> = 
strings.iter().map(Some).collect();
       let parsed_arrays = read_json(array.values().as_slice());
       println!("string array: {:?}", parsed_arrays.columns());
   
       let newlines = strings.join("\n");
       let parsed_newlines = read_json(newlines.as_bytes());
       println!("newline-delimited: {:?}", parsed_newlines.columns());
       assert_eq!(parsed_arrays, parsed_newlines);
   }
   
   #[test]
   fn test_evil_json_parse() {
       // Two invalid JSON strings whose concatenation looks like a single 
valid JSON object literal
       let evil_strings = vec![
           r#"{ "a": { "b": 1"#,
                          r#"0, "c": 40 } }"#,
       ];
       let array: GenericByteArray<Utf8Type> = 
evil_strings.iter().map(Some).collect();
       let parsed_arrays = read_json(array.values().as_slice());
       println!("string array: {:?}", parsed_arrays.columns());
       let newlines = evil_strings.join("\n");
       let parsed_newlines = read_json(newlines.as_bytes());
       println!("newline-delimited: {:?}", parsed_newlines.columns());
       assert_eq!(parsed_arrays, parsed_newlines);
   }
   ```
   
   For valid input (`test_json_parse`), everything works ok, because 
`{...}{...}` parses as two separate JSON object literals with or without the 
newline.
   
   But for invalid input (`test_evil_json_parse`), the array data incorrectly 
parses as a single valid JSON object while the newline-delimited parse 
correctly blows up:
   ```
   test json_tests::test_json_parse ... ok
   test json_tests::test_evil_json_parse ... FAILED
   
   failures:
   
   ---- json_tests::test_evil_json_parse stdout ----
   string array: [StructArray
   [
   -- child 0: "b" (Int64)                                                      
                                                                                
                                                                                
                        
   PrimitiveArray<Int64>
   [
     10,
   ]
   -- child 1: "c" (Int64)
   PrimitiveArray<Int64>
   [
     40,
   ]
   ]]
   thread 'json_tests::test_evil_json_parse' panicked at json_tests.rs:16:86:
   called `Result::unwrap()` on an `Err` value: JsonError("Encountered 
unexpected '0' whilst parsing object")
   ```
   
   </details>
   
   Exposing a little more `Decoder` state would allow a caller to enforce 
robust parsing even when newlines are missing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parsing a string column containing JSON values into a typed array [arrow-rs]

Reply via email to