scovich opened a new issue, #6574:
URL: https://github.com/apache/arrow-rs/issues/6574
**Describe the bug**
Reading this json file:
```json
{"foo":{"a": "a"}}
{"bar":{"b": "b"}}
```
with this schema:
```rust
Schema::new(vec![
Field::new_struct("foo", [
Arc::new(Field::new("a", DataType::Utf8, false))
], true),
Field::new_struct("bar", [
Arc::new(Field::new("b", DataType::Utf8, false))
], true),
])
```
Produces a `StructArray` that has non-nullable arrays with null masks, and
so fails a round trip through
[into_parts](https://docs.rs/arrow/latest/arrow/array/struct.StructArray.html#method.into_parts)
and
[new](https://docs.rs/arrow/latest/arrow/array/struct.StructArray.html#method.new):
```rust
let Some(Ok(sa)) = json_reader.next() else {
panic!("read failed");
};
let (fields, children, nulls) = sa.into_parts();
let sa = StructArray::new(fields, children, nulls); // <<=== PANIC!
```
**To Reproduce**
```rust
use std::{fs::File, io::BufReader, sync::Arc};
use arrow::datatypes::{DataType, Field, Schema};
fn main() {
let schema = Arc::new(Schema::new(vec![
Field::new_struct("foo", [
Arc::new(Field::new("a", DataType::Utf8, false))
], true),
Field::new_struct("bar", [
Arc::new(Field::new("b", DataType::Utf8, false))
], true),
]));
let file = File::open("test.json").unwrap();
let mut json =
arrow_json::ReaderBuilder::new(schema).build(BufReader::new(file)).unwrap();
let batch = json.next().unwrap().unwrap();
println!("Batch: {batch:#?}");
let (fields, children, nulls) = StructArray::from(batch).into_parts();
let sa = StructArray::new(fields, children, nulls); // <<=== PANICS
println!("StructArray: {sa:#?}");
}
```
The docs for
[StructArray::try_new](https://docs.rs/arrow/latest/arrow/array/struct.StructArray.html#method.try_new)
specifically call this out as expected behavior:
> Errors if
> * ...
> * `!fields[i].is_nullable() && !nulls.contains(arrays[i].nulls())`
But that seems over-strict if we assume the JSON reader produces valid
record batches?
It is technically possible to use `unsafe { StructArray::new_unchecked(...)
}`, but that bypasses all the other checks as well.
The relevant arrow-json code is in two places,
[StructArrayDecoder::new](https://arrow.apache.org/rust/src/arrow_json/reader/struct_array.rs.html#45)
(which marks the child as nullable if the parent is nullable), and
[StructArrayDecoder::decode](https://arrow.apache.org/rust/src/arrow_json/reader/struct_array.rs.html#135),
which performs a more precise validation before calling `new_unchecked` (tho
the SAFTEY comment there doesn't seem aware that it's bypassing null checks):
```rust
for (c, f) in child_data.iter().zip(fields) {
// Sanity check
assert_eq!(c.len(), pos.len());
if let Some(a) = c.nulls() {
let nulls_valid =
f.is_nullable() || nulls.as_ref().map(|n|
n.contains(a)).unwrap_or_default();
if !nulls_valid {
return Err(ArrowError::JsonError(format!(
"Encountered unmasked nulls in non-nullable
StructArray child: {f}"
)));
}
}
}
let data = ArrayDataBuilder::new(self.data_type.clone())
.len(pos.len())
.nulls(nulls)
.child_data(child_data);
// Safety
// Validated lengths above
Ok(unsafe { data.build_unchecked() })
```
**Expected behavior**
A valid `StructArray` should survive a round trip through
`StructArray::into_parts` and `StructArray::new`.
**Additional context**
This seems related to https://github.com/apache/arrow-rs/issues/6391 and
https://github.com/apache/arrow-rs/issues/6510?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]