jecsand838 opened a new issue, #8934:
URL: https://github.com/apache/arrow-rs/issues/8934

   **Describe the bug**
   
   There appear to be two issues in the `arrow-avro` `RecordEncoder`:
   
   1. **Nullable struct with non-nullable child field + row-wise sliced 
encoding**
   
      When row-wise encoding a `RecordBatch` with a *nullable* `Struct` field 
containing a *non-nullable* child field, the Avro writer fails with:
   
      ```text
      Invalid argument error: Avro site '{field}' is non-nullable, but array 
contains nulls
      ```
   
      This happens even though the struct value is null at that row (i.e., the 
parent struct is null, so the child field’s value should be ignored).
   
   2. **Dense `UnionArray` with non-zero & non-consecutive type ids**
   
      When encoding a dense `UnionArray` whose `UnionFields` use 
non-consecutive type IDs (e.g. `2` and `5`), the Avro `Writer` fails with:
   
      ```text
      Error: SchemaError("Binding and field mismatch")
      ```
   
      The same data layout should be valid per Arrow’s semantics for dense 
unions, where type ids are not required to be contiguous or start at 0, as long 
as they are listed in `UnionFields`.
   
   **To Reproduce**
   
   1. **Nullable Child Field Encoder Bug**
   
   Running this test in `arrow-avro/src/writer/mod.rs`:
   
   ```rust
       #[test]
       fn test_nullable_struct_with_nonnullable_field_sliced_encoding() {
           use arrow_array::{ArrayRef, Int32Array, StringArray, StructArray};
           use arrow_buffer::NullBuffer;
           use arrow_schema::{DataType, Field, Fields, Schema};
           use std::sync::Arc;
           let inner_fields = Fields::from(vec![
               Field::new("id", DataType::Int32, false), // non-nullable
               Field::new("name", DataType::Utf8, true), // nullable
           ]);
           let inner_struct_type = DataType::Struct(inner_fields.clone());
           let schema = Schema::new(vec![
               Field::new("before", inner_struct_type.clone(), true), // 
nullable struct
               Field::new("after", inner_struct_type.clone(), true),  // 
nullable struct
               Field::new("op", DataType::Utf8, false),               // 
non-nullable
           ]);
           let before_ids = Int32Array::from(vec![None, None]);
           let before_names = StringArray::from(vec![None::<&str>, None]);
           let before_struct = StructArray::new(
               inner_fields.clone(),
               vec![
                   Arc::new(before_ids) as ArrayRef,
                   Arc::new(before_names) as ArrayRef,
               ],
               Some(NullBuffer::from(vec![false, false])),
           );
           let after_ids = Int32Array::from(vec![1, 2]); // non-nullable, no 
nulls
           let after_names = StringArray::from(vec![Some("Alice"), 
Some("Bob")]);
           let after_struct = StructArray::new(
               inner_fields.clone(),
               vec![
                   Arc::new(after_ids) as ArrayRef,
                   Arc::new(after_names) as ArrayRef,
               ],
               Some(NullBuffer::from(vec![true, true])),
           );
           let op_col = StringArray::from(vec!["r", "r"]);
           let batch = RecordBatch::try_new(
               Arc::new(schema.clone()),
               vec![
                   Arc::new(before_struct) as ArrayRef,
                   Arc::new(after_struct) as ArrayRef,
                   Arc::new(op_col) as ArrayRef,
               ],
           )
               .expect("failed to create test batch");
           let mut sink = Vec::new();
           let mut writer = WriterBuilder::new(schema)
               .with_fingerprint_strategy(FingerprintStrategy::Id(1))
               .build::<_, AvroSoeFormat>(&mut sink)
               .expect("failed to create writer");
           for row_idx in 0..batch.num_rows() {
               let single_row = batch.slice(row_idx, 1);
               let after_col = single_row.column(1);
               assert_eq!(
                   after_col.null_count(),
                   0,
                   "after column should have no nulls in sliced row"
               );
               writer
                   .write(&single_row)
                   .unwrap_or_else(|e| panic!("Failed to encode row {row_idx}: 
{e}"));
           }
           writer.finish().expect("failed to finish writer");
           assert!(!sink.is_empty(), "encoded output should not be empty");
       }
   ```
   
   Results in this error:
   
   ```text
   thread 
'writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding' 
(769056) panicked at arrow-avro/src/writer/mod.rs:553:37:
   Failed to encode row 0: Invalid argument error: Avro site 'id' is 
non-nullable, but array contains nulls
   stack backtrace:
      0: __rustc::rust_begin_unwind
                at 
/rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/std/src/panicking.rs:698:5
      1: core::panicking::panic_fmt
                at 
/rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/panicking.rs:75:14
      2: 
arrow_avro::writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding::{{closure}}
                at ./src/writer/mod.rs:553:37
      3: core::result::Result<T,E>::unwrap_or_else
                at 
/Users/connorsanders/.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/result.rs:1615:23
      4: 
arrow_avro::writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding
                at ./src/writer/mod.rs:553:18
      5: 
arrow_avro::writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding::{{closure}}
                at ./src/writer/mod.rs:493:69
      6: core::ops::function::FnOnce::call_once
                at 
/Users/connorsanders/.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
      7: core::ops::function::FnOnce::call_once
                at 
/rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/ops/function.rs:250:5
   note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose 
backtrace.
   ```
   
   2. **Reproducing Union Type Id Encoder Bug**
   
   Running this test in `arrow-avro/src/writer/mod.rs`:
   
   ```rust
       #[test]
       fn test_union_nonzero_type_ids() -> Result<(), ArrowError> {
           use arrow_array::UnionArray;
           use arrow_buffer::Buffer;
           use arrow_schema::UnionFields;
           let union_fields = UnionFields::new(
               vec![2, 5],
               vec![
                   Field::new("v_str", DataType::Utf8, true),
                   Field::new("v_int", DataType::Int32, true),
               ],
           );
           let strings = StringArray::from(vec!["hello", "world"]);
           let ints = Int32Array::from(vec![10, 20, 30]);
           let type_ids = Buffer::from_slice_ref([2_i8, 5, 5, 2, 5]);
           let offsets = Buffer::from_slice_ref([0_i32, 0, 1, 1, 2]);
           let union_array = UnionArray::try_new(
               union_fields.clone(),
               type_ids.into(),
               Some(offsets.into()),
               vec![Arc::new(strings) as ArrayRef, Arc::new(ints) as ArrayRef],
           )?;
           let schema = Schema::new(vec![Field::new(
               "union_col",
               DataType::Union(union_fields, UnionMode::Dense),
               false,
           )]);
           let batch = RecordBatch::try_new(
               Arc::new(schema.clone()),
               vec![Arc::new(union_array) as ArrayRef],
           )?;
           let mut writer = AvroWriter::new(Vec::<u8>::new(), schema.clone())?;
           assert!(!writer.write(&batch).is_err(), "Expected no error from 
writing");
           writer.finish()?;
           assert!(!writer.finish().is_err(), "Expected no error from finishing 
writer");
           Ok(())
       }
   ```
   
   Results in this error:
   
   ```text
   thread 'writer::tests::test_union_nonzero_type_ids' (831962) panicked at 
arrow-avro/src/writer/mod.rs:707:9:
   Expected no error from writing
   stack backtrace:
      0: __rustc::rust_begin_unwind
                at 
/rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/std/src/panicking.rs:698:5
      1: core::panicking::panic_fmt
                at 
/rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/panicking.rs:75:14
      2: arrow_avro::writer::tests::test_union_nonzero_type_ids
                at ./src/writer/mod.rs:707:9
      3: arrow_avro::writer::tests::test_union_nonzero_type_ids::{{closure}}
                at ./src/writer/mod.rs:676:41
      4: core::ops::function::FnOnce::call_once
                at 
/Users/connorsanders/.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
      5: core::ops::function::FnOnce::call_once
                at 
/rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/ops/function.rs:250:5
   note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose 
backtrace.
   ```
   
   **Expected behavior**
   
   1. **Nullable struct / sliced encoding**
      * Encoding each `RecordBatch` slice of length 1 should succeed without 
error.
      * The fact that the parent `Struct` value is null at a given row should 
mean its non-nullable child fields are not validated/encoded for that row.
      * The writer should complete successfully and produce non-empty Avro 
output for each row.
   2. **Dense union with non-zero type ids**
      * A dense `UnionArray` whose `UnionFields` use non-zero and 
non-consecutive type ids (e.g. `2` and `5`) should be encoded without error.
   
   **Additional context**
   
   * Environment:
       - Rust toolchain: `1.91.0` `aarch64-apple-darwin`
   * Both issues arise when using the new `arrow-avro` writer APIs 
(`WriterBuilder` with `AvroSoeFormat` and `AvroWriter`).
   * For **Bug 1**, it looks like the encoder is validating the nullability of 
a non-nullable child field without masking by the parent struct's null bitmap 
(or doing a column-level check that doesn't consider that the parent struct is 
null).
   * For **Bug 2**, the union encoder/decoder expects a 0-based, contiguous 
mapping between union type IDs and underlying child bindings, and fails when 
the type IDs are `2` and `5` even though the `UnionFields` and `type_ids` 
buffer are consistent with Arrow's union semantics.
   * Both bugs surfaced while testing row-wise / streaming-style encodes, which 
rely heavily on correct handling of nested structures and unions in the encoder.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to