kosiew commented on code in PR #19500:
URL: https://github.com/apache/datafusion/pull/19500#discussion_r2703645171
##########
datafusion/common/src/hash_utils.rs:
##########
@@ -513,24 +514,41 @@ fn hash_list_array<OffsetSize>(
where
OffsetSize: OffsetSizeTrait,
{
- let values = array.values();
- let offsets = array.value_offsets();
- let nulls = array.nulls();
- let mut values_hashes = vec![0u64; values.len()];
- create_hashes([values], random_state, &mut values_hashes)?;
- if let Some(nulls) = nulls {
- for (i, (start, stop)) in
offsets.iter().zip(offsets.iter().skip(1)).enumerate() {
- if nulls.is_valid(i) {
+ // In case values is sliced, hash only the bytes used by the offsets of
this ListArray
+ let first_offset =
array.value_offsets().first().cloned().unwrap_or_default();
+ let last_offset =
array.value_offsets().last().cloned().unwrap_or_default();
+ let value_bytes_len = (last_offset - first_offset).as_usize();
+ let mut values_hashes = vec![0u64; value_bytes_len];
Review Comment:
This allocates a fresh `values_hashes` Vec for every list column hashed.
Could we reuse a buffer (similar to HASH_BUFFER above) or early-return for
empty `value_bytes_len` to trim repeated allocations?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]