alamb commented on code in PR #9176:
URL: https://github.com/apache/arrow-rs/pull/9176#discussion_r2738925649


##########
arrow-row/src/lib.rs:
##########
@@ -417,6 +417,54 @@ mod variable;
 ///
 ///```
 ///
+/// ## ListView Encoding
+///
+/// ListView arrays differ from List arrays in their representation: instead 
of using
+/// consecutive offset pairs to define each list, ListView uses explicit 
offset and size
+/// pairs for each element. This allows ListView elements to reference 
arbitrary (potentially
+/// overlapping) regions of the child array.
+///
+/// Despite this structural difference, ListView uses the **same row encoding 
as List**.
+/// Each list value is encoded as the concatenation of its child elements 
(each separately
+/// variable-length encoded), followed by a variable-length encoded empty byte 
array terminator.
+///
+/// **Important**: When a ListView is decoded back from row format, it is 
still a
+/// ListView, but any child element sharing that may have existed in the 
original
+/// (where multiple list entries could reference overlapping regions of the 
child
+/// array) is **not preserved** - each list's children are decoded 
independently
+/// with sequential offsets.
+///
+/// For example, given a ListView with offset/size pairs:
+///
+/// ```text
+/// offsets: [0, 1, 0]
+/// sizes:   [2, 2, 0]
+/// values:  [1_u8, 2_u8, 3_u8]
+///
+/// Resulting lists:
+/// [1_u8, 2_u8]  (offset=0, size=2 -> values[0..2])
+/// [2_u8, 3_u8]  (offset=1, size=2 -> values[1..3])
+/// []            (offset=0, size=0 -> empty)
+/// ```
+///
+/// The elements would be encoded identically to List encoding:
+///
+/// ```text
+///                         ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
+///  [1_u8, 2_u8]           │02│01│01│00│00│02│02│01│02│00│00│02│01│
+///                         └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
+///                          └──── 1_u8 ────┘   └──── 2_u8 ────┘
+///
+///                         ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
+///  [2_u8, 3_u8]           │02│01│02│00│00│02│02│01│03│00│00│02│01│
+///                         └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
+///                          └──── 2_u8 ────┘   └──── 3_u8 ────┘
+///
+///```
+///

Review Comment:
   Thank you 



##########
arrow-ord/src/ord.rs:
##########
@@ -233,6 +233,42 @@ fn compare_fixed_list(
     Ok(f)
 }
 
+fn compare_list_view<O: OffsetSizeTrait>(

Review Comment:
   While useful, I didn't see any test coverage for this code
   
   I think we should either break it into its own PR, or add tests. Here is a 
PR to do so:
   - https://github.com/brancz/arrow-rs/pull/1
   
   (I am also happy to review a separate PR)



##########
arrow-row/src/list.rs:
##########
@@ -323,3 +316,180 @@ pub unsafe fn decode_fixed_size_list(
         builder.build_unchecked()
     }))
 }
+
+/// Computes the encoded length for a single list element given its child rows.
+///
+/// This is used by list types (List, LargeList, ListView, LargeListView) to 
determine
+/// the encoded length of a list element. For null elements, returns 1 (null 
sentinel only).
+/// For valid elements, returns 1 + the sum of padded lengths for each child 
row.
+#[inline]
+fn list_element_encoded_len(rows: &Rows, range: Option<Range<usize>>) -> usize 
{
+    match range {
+        None => 1,
+        Some(range) => {
+            1 + range
+                .map(|i| 
super::variable::padded_length(Some(rows.row(i).as_ref().len())))
+                .sum::<usize>()
+        }
+    }
+}
+
+/// Computes the encoded lengths for a `GenericListViewArray`
+///
+/// `rows` should contain the encoded child elements
+pub fn compute_lengths_list_view<O: OffsetSizeTrait>(
+    lengths: &mut [usize],
+    rows: &Rows,
+    array: &GenericListViewArray<O>,
+    shift: usize,
+) {
+    let offsets = array.value_offsets();
+    let sizes = array.value_sizes();
+
+    lengths.iter_mut().enumerate().for_each(|(idx, length)| {
+        let start = offsets[idx].as_usize() - shift;
+        let size = sizes[idx].as_usize();
+        let range = array.is_valid(idx).then_some(start..start + size);
+        *length += list_element_encoded_len(rows, range);
+    });
+}
+
+/// Encodes the provided `GenericListViewArray` to `out` with the provided 
`SortOptions`
+///
+/// `rows` should contain the encoded child elements
+pub fn encode_list_view<O: OffsetSizeTrait>(
+    data: &mut [u8],
+    out_offsets: &mut [usize],
+    rows: &Rows,
+    opts: SortOptions,
+    array: &GenericListViewArray<O>,
+    shift: usize,
+) {
+    let offsets = array.value_offsets();
+    let sizes = array.value_sizes();
+
+    out_offsets
+        .iter_mut()
+        .skip(1)
+        .enumerate()
+        .for_each(|(idx, offset)| {
+            let start = offsets[idx].as_usize() - shift;
+            let size = sizes[idx].as_usize();
+            let range = array.is_valid(idx).then_some(start..start + size);
+            let out = &mut data[*offset..];
+            *offset += encode_one(out, rows, range, opts)
+        });
+}
+
+/// Decodes a `GenericListViewArray` from `rows` with the provided `options`
+///
+/// # Safety
+///
+/// `rows` must contain valid data for the provided `converter`
+pub unsafe fn decode_list_view<O: OffsetSizeTrait>(
+    converter: &RowConverter,
+    rows: &mut [&[u8]],
+    field: &SortField,
+    validate_utf8: bool,
+) -> Result<GenericListViewArray<O>, ArrowError> {
+    let opts = field.options;
+
+    let mut values_bytes = 0;
+
+    let mut child_count = 0usize;
+    let mut list_sizes: Vec<O> = Vec::with_capacity(rows.len());
+
+    // First pass: count children and compute sizes
+    for row in rows.iter_mut() {
+        let mut row_offset = 0;
+        let mut list_size = 0usize;
+        loop {
+            let decoded = super::variable::decode_blocks(&row[row_offset..], 
opts, |x| {
+                values_bytes += x.len();
+            });
+            if decoded <= 1 {
+                list_sizes.push(O::usize_as(list_size));
+                break;
+            }
+            row_offset += decoded;
+            child_count += 1;
+            list_size += 1;
+        }
+    }
+    O::from_usize(child_count).expect("overflow");

Review Comment:
   I think being consistent is fine -- especially since this is erroring when 
child_count doesn't fit in a usize (which means 4B children on 32 bit 
architectures, so I don't think it will be common)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to