jecsand838 opened a new issue, #8934:
URL: https://github.com/apache/arrow-rs/issues/8934
**Describe the bug**
There appear to be two issues in the `arrow-avro` `RecordEncoder`:
1. **Nullable struct with non-nullable child field + row-wise sliced
encoding**
When row-wise encoding a `RecordBatch` with a *nullable* `Struct` field
containing a *non-nullable* child field, the Avro writer fails with:
```text
Invalid argument error: Avro site '{field}' is non-nullable, but array
contains nulls
```
This happens even though the struct value is null at that row (i.e., the
parent struct is null, so the child field’s value should be ignored).
2. **Dense `UnionArray` with non-zero & non-consecutive type ids**
When encoding a dense `UnionArray` whose `UnionFields` use
non-consecutive type IDs (e.g. `2` and `5`), the Avro `Writer` fails with:
```text
Error: SchemaError("Binding and field mismatch")
```
The same data layout should be valid per Arrow’s semantics for dense
unions, where type ids are not required to be contiguous or start at 0, as long
as they are listed in `UnionFields`.
**To Reproduce**
1. **Nullable Child Field Encoder Bug**
Running this test in `arrow-avro/src/writer/mod.rs`:
```rust
#[test]
fn test_nullable_struct_with_nonnullable_field_sliced_encoding() {
use arrow_array::{ArrayRef, Int32Array, StringArray, StructArray};
use arrow_buffer::NullBuffer;
use arrow_schema::{DataType, Field, Fields, Schema};
use std::sync::Arc;
let inner_fields = Fields::from(vec![
Field::new("id", DataType::Int32, false), // non-nullable
Field::new("name", DataType::Utf8, true), // nullable
]);
let inner_struct_type = DataType::Struct(inner_fields.clone());
let schema = Schema::new(vec![
Field::new("before", inner_struct_type.clone(), true), //
nullable struct
Field::new("after", inner_struct_type.clone(), true), //
nullable struct
Field::new("op", DataType::Utf8, false), //
non-nullable
]);
let before_ids = Int32Array::from(vec![None, None]);
let before_names = StringArray::from(vec![None::<&str>, None]);
let before_struct = StructArray::new(
inner_fields.clone(),
vec![
Arc::new(before_ids) as ArrayRef,
Arc::new(before_names) as ArrayRef,
],
Some(NullBuffer::from(vec![false, false])),
);
let after_ids = Int32Array::from(vec![1, 2]); // non-nullable, no
nulls
let after_names = StringArray::from(vec![Some("Alice"),
Some("Bob")]);
let after_struct = StructArray::new(
inner_fields.clone(),
vec![
Arc::new(after_ids) as ArrayRef,
Arc::new(after_names) as ArrayRef,
],
Some(NullBuffer::from(vec![true, true])),
);
let op_col = StringArray::from(vec!["r", "r"]);
let batch = RecordBatch::try_new(
Arc::new(schema.clone()),
vec![
Arc::new(before_struct) as ArrayRef,
Arc::new(after_struct) as ArrayRef,
Arc::new(op_col) as ArrayRef,
],
)
.expect("failed to create test batch");
let mut sink = Vec::new();
let mut writer = WriterBuilder::new(schema)
.with_fingerprint_strategy(FingerprintStrategy::Id(1))
.build::<_, AvroSoeFormat>(&mut sink)
.expect("failed to create writer");
for row_idx in 0..batch.num_rows() {
let single_row = batch.slice(row_idx, 1);
let after_col = single_row.column(1);
assert_eq!(
after_col.null_count(),
0,
"after column should have no nulls in sliced row"
);
writer
.write(&single_row)
.unwrap_or_else(|e| panic!("Failed to encode row {row_idx}:
{e}"));
}
writer.finish().expect("failed to finish writer");
assert!(!sink.is_empty(), "encoded output should not be empty");
}
```
Results in this error:
```text
thread
'writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding'
(769056) panicked at arrow-avro/src/writer/mod.rs:553:37:
Failed to encode row 0: Invalid argument error: Avro site 'id' is
non-nullable, but array contains nulls
stack backtrace:
0: __rustc::rust_begin_unwind
at
/rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/std/src/panicking.rs:698:5
1: core::panicking::panic_fmt
at
/rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/panicking.rs:75:14
2:
arrow_avro::writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding::{{closure}}
at ./src/writer/mod.rs:553:37
3: core::result::Result<T,E>::unwrap_or_else
at
/Users/connorsanders/.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/result.rs:1615:23
4:
arrow_avro::writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding
at ./src/writer/mod.rs:553:18
5:
arrow_avro::writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding::{{closure}}
at ./src/writer/mod.rs:493:69
6: core::ops::function::FnOnce::call_once
at
/Users/connorsanders/.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
7: core::ops::function::FnOnce::call_once
at
/rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose
backtrace.
```
2. **Reproducing Union Type Id Encoder Bug**
Running this test in `arrow-avro/src/writer/mod.rs`:
```rust
#[test]
fn test_union_nonzero_type_ids() -> Result<(), ArrowError> {
use arrow_array::UnionArray;
use arrow_buffer::Buffer;
use arrow_schema::UnionFields;
let union_fields = UnionFields::new(
vec![2, 5],
vec![
Field::new("v_str", DataType::Utf8, true),
Field::new("v_int", DataType::Int32, true),
],
);
let strings = StringArray::from(vec!["hello", "world"]);
let ints = Int32Array::from(vec![10, 20, 30]);
let type_ids = Buffer::from_slice_ref([2_i8, 5, 5, 2, 5]);
let offsets = Buffer::from_slice_ref([0_i32, 0, 1, 1, 2]);
let union_array = UnionArray::try_new(
union_fields.clone(),
type_ids.into(),
Some(offsets.into()),
vec![Arc::new(strings) as ArrayRef, Arc::new(ints) as ArrayRef],
)?;
let schema = Schema::new(vec![Field::new(
"union_col",
DataType::Union(union_fields, UnionMode::Dense),
false,
)]);
let batch = RecordBatch::try_new(
Arc::new(schema.clone()),
vec![Arc::new(union_array) as ArrayRef],
)?;
let mut writer = AvroWriter::new(Vec::<u8>::new(), schema.clone())?;
assert!(!writer.write(&batch).is_err(), "Expected no error from
writing");
writer.finish()?;
assert!(!writer.finish().is_err(), "Expected no error from finishing
writer");
Ok(())
}
```
Results in this error:
```text
thread 'writer::tests::test_union_nonzero_type_ids' (831962) panicked at
arrow-avro/src/writer/mod.rs:707:9:
Expected no error from writing
stack backtrace:
0: __rustc::rust_begin_unwind
at
/rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/std/src/panicking.rs:698:5
1: core::panicking::panic_fmt
at
/rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/panicking.rs:75:14
2: arrow_avro::writer::tests::test_union_nonzero_type_ids
at ./src/writer/mod.rs:707:9
3: arrow_avro::writer::tests::test_union_nonzero_type_ids::{{closure}}
at ./src/writer/mod.rs:676:41
4: core::ops::function::FnOnce::call_once
at
/Users/connorsanders/.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
5: core::ops::function::FnOnce::call_once
at
/rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose
backtrace.
```
**Expected behavior**
1. **Nullable struct / sliced encoding**
* Encoding each `RecordBatch` slice of length 1 should succeed without
error.
* The fact that the parent `Struct` value is null at a given row should
mean its non-nullable child fields are not validated/encoded for that row.
* The writer should complete successfully and produce non-empty Avro
output for each row.
2. **Dense union with non-zero type ids**
* A dense `UnionArray` whose `UnionFields` use non-zero and
non-consecutive type ids (e.g. `2` and `5`) should be encoded without error.
**Additional context**
* Environment:
- Rust toolchain: `1.91.0` `aarch64-apple-darwin`
* Both issues arise when using the new `arrow-avro` writer APIs
(`WriterBuilder` with `AvroSoeFormat` and `AvroWriter`).
* For **Bug 1**, it looks like the encoder is validating the nullability of
a non-nullable child field without masking by the parent struct's null bitmap
(or doing a column-level check that doesn't consider that the parent struct is
null).
* For **Bug 2**, the union encoder/decoder expects a 0-based, contiguous
mapping between union type IDs and underlying child bindings, and fails when
the type IDs are `2` and `5` even though the `UnionFields` and `type_ids`
buffer are consistent with Arrow's union semantics.
* Both bugs surfaced while testing row-wise / streaming-style encodes, which
rely heavily on correct handling of nested structures and unions in the encoder.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]