Re: [PR] Refactor ListArray hashing to consider only sliced values [datafusion]

via GitHub Mon, 19 Jan 2026 00:08:49 -0800


kosiew commented on code in PR #19500:
URL: https://github.com/apache/datafusion/pull/19500#discussion_r2703645171



##########
datafusion/common/src/hash_utils.rs:
##########
@@ -513,24 +514,41 @@ fn hash_list_array<OffsetSize>(
 where
     OffsetSize: OffsetSizeTrait,
 {
-    let values = array.values();
-    let offsets = array.value_offsets();
-    let nulls = array.nulls();
-    let mut values_hashes = vec![0u64; values.len()];
-    create_hashes([values], random_state, &mut values_hashes)?;
-    if let Some(nulls) = nulls {
-        for (i, (start, stop)) in 
offsets.iter().zip(offsets.iter().skip(1)).enumerate() {
-            if nulls.is_valid(i) {
+    // In case values is sliced, hash only the bytes used by the offsets of 
this ListArray
+    let first_offset = 
array.value_offsets().first().cloned().unwrap_or_default();
+    let last_offset = 
array.value_offsets().last().cloned().unwrap_or_default();
+    let value_bytes_len = (last_offset - first_offset).as_usize();
+    let mut values_hashes = vec![0u64; value_bytes_len];

Review Comment:
   This allocates a fresh `values_hashes` Vec for every list column hashed. 
   Could we reuse a buffer (similar to HASH_BUFFER above) or early-return for 
empty `value_bytes_len` to trim repeated allocations?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Refactor ListArray hashing to consider only sliced values [datafusion]

Reply via email to