lichuang commented on code in PR #9275:
URL: https://github.com/apache/arrow-rs/pull/9275#discussion_r2739603634


##########
arrow-row/src/list.rs:
##########
@@ -120,17 +120,35 @@ pub unsafe fn decode<O: OffsetSizeTrait>(
     let mut offsets = Vec::with_capacity(rows.len() + 1);
     offsets.push(O::usize_as(0));
 
+    // Check if child is Null type - if so, we need special handling
+    // to count elements correctly even when they decode to 0 bytes
+    let is_null_child = matches!(&field.data_type, DataType::List(f) if 
f.data_type() == &DataType::Null)

Review Comment:
   Hi @alamb, thank you for the review. I agree that this solution is treating 
the symptom rather than the underlying root cause. Let me explain the rationale 
and what a proper fix would involve.
   
     The root cause is in the encoding phase: for List<Null>, both child 
elements and the list end marker are encoded as `EMPTY_SENTINEL` (1 byte), 
making them indistinguishable during decoding.
   
    However, fixing it at the encoding level would require:
   
     1. Breaking change to the row format: We'd need to encode NullArray 
elements differently (e.g., using `NON_EMPTY_SENTINEL` actual data blocks, or 
changing the list terminator)
     2. Backward compatibility concerns: The row format is a stable 
serialization format. Changing encoding would break compatlity with existing 
serialized data
     3. Complexity: Any encoding change would need to handle both 
ascending/descending orders, nested lists, and ensure comparn semantics remain 
correct
   
   Since `List<Null>` is an edge case, Only `NullArray` has this ambiguity 
since it produces zero bytes of data, so Existing serialized data continues to 
work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to