scovich commented on issue #6522: URL: https://github.com/apache/arrow-rs/issues/6522#issuecomment-2638283760
> As far as I know, arrow json reader is for cases where you have a json for each row of a table; but I have a column of jsons and wanted to keep them all in a single column That's definitely the advertized use case for the arrow json reader. But it turns out that an arrow `RecordBatch` is almost (not quite) a `StructArray` if you [squint](https://docs.rs/arrow/latest/arrow/array/struct.StructArray.html#comparison-with-recordbatch), and they support both [forward](https://docs.rs/arrow/latest/arrow/record_batch/struct.RecordBatch.html#impl-From%3CStructArray%3E-for-RecordBatch) and [reverse](https://docs.rs/arrow/latest/arrow/array/struct.StructArray.html#impl-From%3CRecordBatch%3E-for-StructArray) conversions. Consider the following newline-delimited JSON string: ``` { "a": { "b": 10, "c": 20 } } { "a": { "b": 30, "c": 40 } } ``` If we wrote those 60 bytes out to a file, the [arrow_json::reader::ReaderBuilder::build_decoder](https://docs.rs/arrow-json/50.0.0/arrow_json/reader/struct.ReaderBuilder.html#method.build_decoder) could read the bytes back in and parse them into a 2-row `RecordBatch` (which started out internally as a `StructArray` and can be converted back to a `StructArray` if desired). We could also take those exact same 60 bytes and store them in a `StringArray`, using offsets `[0, 30, 60]` to delimit the records... but then there's no way to reliably parse the result into a `StructArray` because the decoder doesn't support arrow lists as input and also doesn't expose enough state for an outsider to manage the decoding manually in a robust way. NOTE: We could _almost_ get it working by observing that `StringArray` is really just a [GenericByteArray](https://docs.rs/arrow/latest/arrow/array/struct.GenericByteArray.html), and the backing bytes are a contiguous array we could wrap up as a `Reader`. But there's no guarantee such a string array had the newline endings that the public json reader expects, which messes up error handling: <details> The following executable example demonstrates the robustness problem: ```rust use arrow_array::{GenericByteArray, RecordBatch}; use arrow_array::types::Utf8Type; use arrow_json::ReaderBuilder; use arrow_schema::{DataType, Field, Schema}; use std::io::BufReader; use std::sync::Arc; fn read_json(bytes: &[u8]) -> RecordBatch { let fields = vec![ Field::new("b", DataType::Int64, false), Field::new("c", DataType::Int64, false), ]; let schema = Arc::new(Schema::new(vec![ Field::new("a", DataType::Struct(fields.into_iter().collect()), false), ])); ReaderBuilder::new(schema).build(BufReader::new(bytes)).unwrap().next().unwrap().unwrap() } #[test] fn test_json_parse() { let strings = vec![ r#"{ "a": { "b": 10, "c": 20 } }"#, r#"{ "a": { "b": 30, "c": 40 } }"#, ]; let array: GenericByteArray<Utf8Type> = strings.iter().map(Some).collect(); let parsed_arrays = read_json(array.values().as_slice()); println!("string array: {:?}", parsed_arrays.columns()); let newlines = strings.join("\n"); let parsed_newlines = read_json(newlines.as_bytes()); println!("newline-delimited: {:?}", parsed_newlines.columns()); assert_eq!(parsed_arrays, parsed_newlines); } #[test] fn test_evil_json_parse() { // Two invalid JSON strings whose concatenation looks like a single valid JSON object literal let evil_strings = vec![ r#"{ "a": { "b": 1"#, r#"0, "c": 40 } }"#, ]; let array: GenericByteArray<Utf8Type> = evil_strings.iter().map(Some).collect(); let parsed_arrays = read_json(array.values().as_slice()); println!("string array: {:?}", parsed_arrays.columns()); let newlines = evil_strings.join("\n"); let parsed_newlines = read_json(newlines.as_bytes()); println!("newline-delimited: {:?}", parsed_newlines.columns()); assert_eq!(parsed_arrays, parsed_newlines); } ``` For valid input (`test_json_parse`), everything works ok, because `{...}{...}` parses as two separate JSON object literals with or without the newline. But for invalid input (`test_evil_json_parse`), the array data incorrectly parses as a single valid JSON object while the newline-delimited parse correctly blows up: ``` test json_tests::test_json_parse ... ok test json_tests::test_evil_json_parse ... FAILED failures: ---- json_tests::test_evil_json_parse stdout ---- string array: [StructArray [ -- child 0: "b" (Int64) PrimitiveArray<Int64> [ 10, ] -- child 1: "c" (Int64) PrimitiveArray<Int64> [ 40, ] ]] thread 'json_tests::test_evil_json_parse' panicked at json_tests.rs:16:86: called `Result::unwrap()` on an `Err` value: JsonError("Encountered unexpected '0' whilst parsing object") ``` </details> Exposing a little more `Decoder` state would allow a caller to enforce robust parsing even when newlines are missing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
