scovich opened a new issue, #6574:
URL: https://github.com/apache/arrow-rs/issues/6574

   **Describe the bug**
   
   Reading this json file:
   ```json
   {"foo":{"a": "a"}}
   {"bar":{"b": "b"}}
   ```
   with this schema: 
   ```rust
   Schema::new(vec![
       Field::new_struct("foo", [
           Arc::new(Field::new("a", DataType::Utf8, false))
       ], true),
       Field::new_struct("bar", [
           Arc::new(Field::new("b", DataType::Utf8, false))
       ], true),
   ])
   ```
   Produces a `StructArray` that has non-nullable arrays with null masks, and 
so fails a round trip through 
[into_parts](https://docs.rs/arrow/latest/arrow/array/struct.StructArray.html#method.into_parts)
 and 
[new](https://docs.rs/arrow/latest/arrow/array/struct.StructArray.html#method.new):
   ```rust
   let Some(Ok(sa)) = json_reader.next() else { 
     panic!("read failed");
   };
   let (fields, children, nulls) = sa.into_parts();
   let sa = StructArray::new(fields, children, nulls); // <<=== PANIC!
   ```
   
   **To Reproduce**
   
   ```rust
   use std::{fs::File, io::BufReader, sync::Arc};
   
   use arrow::datatypes::{DataType, Field, Schema};
   
   fn main() {
       let schema = Arc::new(Schema::new(vec![
           Field::new_struct("foo", [
               Arc::new(Field::new("a", DataType::Utf8, false))
           ], true),
           Field::new_struct("bar", [
               Arc::new(Field::new("b", DataType::Utf8, false))
           ], true),
       ]));
   
       let file = File::open("test.json").unwrap();
   
       let mut json = 
arrow_json::ReaderBuilder::new(schema).build(BufReader::new(file)).unwrap();
       let batch = json.next().unwrap().unwrap();
       println!("Batch: {batch:#?}");
       let (fields, children, nulls) = StructArray::from(batch).into_parts();
       let sa = StructArray::new(fields, children, nulls); // <<=== PANICS
       println!("StructArray: {sa:#?}");
   }
   ```
   The docs for 
[StructArray::try_new](https://docs.rs/arrow/latest/arrow/array/struct.StructArray.html#method.try_new)
 specifically call this out as expected behavior:
   > Errors if
   > * ... 
   > * `!fields[i].is_nullable() && !nulls.contains(arrays[i].nulls())`
   
   But that seems over-strict if we assume the JSON reader produces valid 
record batches?
   
   It is technically possible to use `unsafe { StructArray::new_unchecked(...) 
}`, but that bypasses all the other checks as well.
   
   The relevant arrow-json code is in two places, 
[StructArrayDecoder::new](https://arrow.apache.org/rust/src/arrow_json/reader/struct_array.rs.html#45)
 (which marks the child as nullable if the parent is nullable), and 
[StructArrayDecoder::decode](https://arrow.apache.org/rust/src/arrow_json/reader/struct_array.rs.html#135),
 which performs a more precise validation before calling `new_unchecked` (tho 
the SAFTEY comment there doesn't seem aware that it's bypassing null checks):
   ```rust
           for (c, f) in child_data.iter().zip(fields) {
               // Sanity check
               assert_eq!(c.len(), pos.len());
               if let Some(a) = c.nulls() {
                   let nulls_valid =
                       f.is_nullable() || nulls.as_ref().map(|n| 
n.contains(a)).unwrap_or_default();
   
                   if !nulls_valid {
                       return Err(ArrowError::JsonError(format!(
                           "Encountered unmasked nulls in non-nullable 
StructArray child: {f}"
                       )));
                   }
               }
           }
   
           let data = ArrayDataBuilder::new(self.data_type.clone())
               .len(pos.len())
               .nulls(nulls)
               .child_data(child_data);
   
           // Safety
           // Validated lengths above
           Ok(unsafe { data.build_unchecked() })
   ```
   
   **Expected behavior**
   
   A valid `StructArray` should survive a round trip through 
`StructArray::into_parts` and `StructArray::new`.
   
   **Additional context**
   
   This seems related to https://github.com/apache/arrow-rs/issues/6391 and 
https://github.com/apache/arrow-rs/issues/6510?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to