jecsand838 opened a new pull request, #8935:
URL: https://github.com/apache/arrow-rs/pull/8935

   # Which issue does this PR close?
   
   - Closes https://github.com/apache/arrow-rs/issues/8934
   
   # Rationale for this change
   
   The `arrow-avro` writer currently fails on two classes of *valid* Arrow 
inputs:
   
   1. **Nullable `Struct` with non‑nullable children + row‑wise sliced 
encoding**
   
      When encoding a `RecordBatch` row‑by‑row, a nullable `Struct` field whose 
child is non‑nullable can cause the writer to error with `Invalid argument 
error: Avro site '{field}' is non-nullable, but array contains nulls`, even 
when the parent `Struct` is null at that row and the child value should be 
ignored.
   
   2. **Dense `UnionArray` with non‑zero, non‑consecutive type ids**
   
      A dense `UnionArray` whose `UnionFields` use type ids such as `2` and `5` 
will currently fail with a `SchemaError("Binding and field mismatch")`, even 
though this layout is valid per Arrow’s union semantics.
   
   This PR updates the `RecordEncoder` to resolve both of these issues and 
better respect Arrow’s struct/union semantics.
   
   # What changes are included in this PR?
   
   This PR touches only the `arrow-avro` writer implementation, specifically 
`arrow-avro/src/writer/encoder.rs` and `arrow-avro/src/writer/mod.rs`.
   
   **1. Fix nullable struct + non‑nullable child handling**
   
   * Adjusts the `RecordEncoder` / `StructEncoder` path so that **child field 
null validation is masked by the parent `Struct`’s null bitmap**.
   * For rows where the parent `Struct` value is null, the encoder now **skips 
encoding the non‑nullable children** for that row, instead of treating any 
child‑side nulls as a violation of the Avro site’s nullability.
   * This ensures that row‑wise encoding of a sliced `RecordBatch`, like the 
one in the issue’s reproducing test, now succeeds without triggering `Invalid 
argument error: Avro site '{field}' is non-nullable, but array contains nulls`.
   
   **2. Support dense unions with non‑zero, non‑consecutive type ids**
   
   * Updates the union encoding path (`UnionEncoder`) so that it no longer 
assumes Arrow dense union type IDs are `0..N-1`.
   * The encoder now **builds an explicit mapping from Arrow `type_ids` (as 
declared in `UnionFields`) to Avro union branch indices**, and uses this 
mapping when:
     * constructing the union’s Avro schema binding, and
     * writing out the branch index and value for each union element.
   * As a result, dense unions with type ids such as `2` and `5` now encode 
successfully, matching Arrow’s semantics that only require type ids to be 
consistent with `UnionFields`, not only contiguous and/or zero‑based.
   
   **3. Regression tests for both bugs**
   
   Adds targeted regression tests under `arrow-avro/src/writer/mod.rs`’s test 
module to validate the fixes:
   
   1. **`test_nullable_struct_with_nonnullable_field_sliced_encoding`**
     * Builds the nullable `Struct` + non‑nullable child scenario from the 
issue.
     * Encodes the `RecordBatch` one row at a time via 
`WriterBuilder::new(schema).with_fingerprint_strategy(FingerprintStrategy::Id(1)).build::<_,
 AvroSoeFormat>(...)` and asserts all rows encode successfully.
   2. **`test_nullable_struct_with_decimal_and_timestamp_sliced`**
     * Constructs a `RecordBatch` containing nullable `Struct` fields populated 
with `Decimal128` and `TimestampMicrosecond` types to verify encoding of 
complex nested data.
     * Encodes the `RecordBatch` one row at a time using `AvroSoeFormat` and 
`FingerprintStrategy::Id(1)`, asserting that each sliced row encodes 
successfully.
   3. **`non_nullable_child_in_nullable_struct_should_encode_per_row`**
     * Builds a test case with a nullable `Struct` column containing a 
non-nullable child field, alongside a timestamp column.
     * Slices a single row from the batch and writes it via `AvroSoeFormat`, 
asserting that `writer.write` returns `Ok` to confirm the fix for sliced 
encoding constraints.
   4. **`test_union_nonzero_type_ids`**
     * Constructs a dense `UnionArray` whose `UnionFields` use type ids `[2, 
5]` and a mix of string/int values.
     * Encodes via `AvroWriter` and asserts that writing and finishing the 
writer both succeed without error.
   
   Together these tests reproduce the failures described in #8934 and confirm 
that the new encoder behavior handles them correctly.
   
   # Are these changes tested?
   
   Yes.
   
   * New unit tests are added for both regression scenarios (nullable struct + 
non‑nullable child, and dense union with non‑zero & non‑consecutive type ids).
   * Existing writer / reader integration tests (round‑trip tests, nested 
record tests, etc.) continue to pass unchanged, ensuring that the crate’s 
previously tested behavior / public API remains intact without breaking changes.
   
   # Are there any user-facing changes?
   
   1. **Behavioral change (bug fix):**
     * Previously, valid and supported Arrow inputs could cause the Avro writer 
to error or panic in the two scenarios described above.
     * After this change, those inputs encode successfully and produce Avro 
output consistent with the generated or provided Avro schema.
   2. **APIs and configuration:**
     * No public APIs, types, or configuration options are added, removed, or 
renamed.
     * The on‑wire Avro representation for already‑supported layouts is 
unchanged; the encoder simply now accepts valid Arrow layouts that were failing 
prior.
   
   The change is strictly a non-breaking backwards compatible bug fix that 
makes the `arrow-avro` writer function as expected.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to