Jefffrey commented on code in PR #9275:
URL: https://github.com/apache/arrow-rs/pull/9275#discussion_r2740305226
##########
arrow-row/src/list.rs:
##########
@@ -120,17 +120,35 @@ pub unsafe fn decode<O: OffsetSizeTrait>(
let mut offsets = Vec::with_capacity(rows.len() + 1);
offsets.push(O::usize_as(0));
+ // Check if child is Null type - if so, we need special handling
+ // to count elements correctly even when they decode to 0 bytes
+ let is_null_child = matches!(&field.data_type, DataType::List(f) if
f.data_type() == &DataType::Null)
Review Comment:
> 1. Breaking change to the row format: We'd need to encode NullArray
elements differently (e.g., using NON_EMPTY_SENTINEL actual data blocks, or
changing the list terminator)
We shouldn't be too worried about breaking changes, because the row format
explicitly describes itself as potentially changing from release to release:
https://github.com/apache/arrow-rs/blob/caf1c6022c71af00ef712e9e865acfee74169f0d/arrow-row/src/lib.rs#L186-L189
> 2. Backward compatibility concerns: The row format is a stable
serialization format. Changing encoding would break compatlity with existing
serialized data
Same here
> 3. Complexity: Any encoding change would need to handle both
ascending/descending orders, nested lists, and ensure comparn semantics remain
correct
I believe the main encoding fix is ensuring `NULL` datatype is properly
encoded as a value that is null, and not as an empty value; see my comments on
the original issue. I feel this fix adds more complexity since it handles a bug
in the encoding logic, but at the decoding stage.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]