jecsand838 opened a new pull request, #8935: URL: https://github.com/apache/arrow-rs/pull/8935
# Which issue does this PR close? - Closes https://github.com/apache/arrow-rs/issues/8934 # Rationale for this change The `arrow-avro` writer currently fails on two classes of *valid* Arrow inputs: 1. **Nullable `Struct` with non‑nullable children + row‑wise sliced encoding** When encoding a `RecordBatch` row‑by‑row, a nullable `Struct` field whose child is non‑nullable can cause the writer to error with `Invalid argument error: Avro site '{field}' is non-nullable, but array contains nulls`, even when the parent `Struct` is null at that row and the child value should be ignored. 2. **Dense `UnionArray` with non‑zero, non‑consecutive type ids** A dense `UnionArray` whose `UnionFields` use type ids such as `2` and `5` will currently fail with a `SchemaError("Binding and field mismatch")`, even though this layout is valid per Arrow’s union semantics. This PR updates the `RecordEncoder` to resolve both of these issues and better respect Arrow’s struct/union semantics. # What changes are included in this PR? This PR touches only the `arrow-avro` writer implementation, specifically `arrow-avro/src/writer/encoder.rs` and `arrow-avro/src/writer/mod.rs`. **1. Fix nullable struct + non‑nullable child handling** * Adjusts the `RecordEncoder` / `StructEncoder` path so that **child field null validation is masked by the parent `Struct`’s null bitmap**. * For rows where the parent `Struct` value is null, the encoder now **skips encoding the non‑nullable children** for that row, instead of treating any child‑side nulls as a violation of the Avro site’s nullability. * This ensures that row‑wise encoding of a sliced `RecordBatch`, like the one in the issue’s reproducing test, now succeeds without triggering `Invalid argument error: Avro site '{field}' is non-nullable, but array contains nulls`. **2. Support dense unions with non‑zero, non‑consecutive type ids** * Updates the union encoding path (`UnionEncoder`) so that it no longer assumes Arrow dense union type IDs are `0..N-1`. * The encoder now **builds an explicit mapping from Arrow `type_ids` (as declared in `UnionFields`) to Avro union branch indices**, and uses this mapping when: * constructing the union’s Avro schema binding, and * writing out the branch index and value for each union element. * As a result, dense unions with type ids such as `2` and `5` now encode successfully, matching Arrow’s semantics that only require type ids to be consistent with `UnionFields`, not only contiguous and/or zero‑based. **3. Regression tests for both bugs** Adds targeted regression tests under `arrow-avro/src/writer/mod.rs`’s test module to validate the fixes: 1. **`test_nullable_struct_with_nonnullable_field_sliced_encoding`** * Builds the nullable `Struct` + non‑nullable child scenario from the issue. * Encodes the `RecordBatch` one row at a time via `WriterBuilder::new(schema).with_fingerprint_strategy(FingerprintStrategy::Id(1)).build::<_, AvroSoeFormat>(...)` and asserts all rows encode successfully. 2. **`test_nullable_struct_with_decimal_and_timestamp_sliced`** * Constructs a `RecordBatch` containing nullable `Struct` fields populated with `Decimal128` and `TimestampMicrosecond` types to verify encoding of complex nested data. * Encodes the `RecordBatch` one row at a time using `AvroSoeFormat` and `FingerprintStrategy::Id(1)`, asserting that each sliced row encodes successfully. 3. **`non_nullable_child_in_nullable_struct_should_encode_per_row`** * Builds a test case with a nullable `Struct` column containing a non-nullable child field, alongside a timestamp column. * Slices a single row from the batch and writes it via `AvroSoeFormat`, asserting that `writer.write` returns `Ok` to confirm the fix for sliced encoding constraints. 4. **`test_union_nonzero_type_ids`** * Constructs a dense `UnionArray` whose `UnionFields` use type ids `[2, 5]` and a mix of string/int values. * Encodes via `AvroWriter` and asserts that writing and finishing the writer both succeed without error. Together these tests reproduce the failures described in #8934 and confirm that the new encoder behavior handles them correctly. # Are these changes tested? Yes. * New unit tests are added for both regression scenarios (nullable struct + non‑nullable child, and dense union with non‑zero & non‑consecutive type ids). * Existing writer / reader integration tests (round‑trip tests, nested record tests, etc.) continue to pass unchanged, ensuring that the crate’s previously tested behavior / public API remains intact without breaking changes. # Are there any user-facing changes? 1. **Behavioral change (bug fix):** * Previously, valid and supported Arrow inputs could cause the Avro writer to error or panic in the two scenarios described above. * After this change, those inputs encode successfully and produce Avro output consistent with the generated or provided Avro schema. 2. **APIs and configuration:** * No public APIs, types, or configuration options are added, removed, or renamed. * The on‑wire Avro representation for already‑supported layouts is unchanged; the encoder simply now accepts valid Arrow layouts that were failing prior. The change is strictly a non-breaking backwards compatible bug fix that makes the `arrow-avro` writer function as expected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
