Jefffrey commented on code in PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#discussion_r2766542459


##########
datafusion/common/src/hash_utils.rs:
##########
@@ -449,6 +472,14 @@ fn hash_struct_array(
     random_state: &RandomState,
     hashes_buffer: &mut [u64],
 ) -> Result<()> {
+    // This nested-type hasher currently always combines its computed 
struct-row hash
+    // into `hashes_buffer` (equivalent to `rehash=true`). This preserves 
existing
+    // behavior for single-column hashing of nested types.
+    //
+    // If we were to add a `rehash` flag here and make `rehash=false` 
overwrite the
+    // buffer, it would change the numeric hash values produced for standalone
+    // Struct columns.

Review Comment:
   I think we should look into fixing this instead of leaving the reasoning as 
"keep existing behaviour", especially when we don't know the root cause of why 
this existing behaviour is the way it is



##########
datafusion/common/src/hash_utils.rs:
##########
@@ -400,6 +417,8 @@ fn update_hash_for_dict_key(
     dict_hashes: &[u64],
     dict_values: &dyn Array,
     idx: usize,
+    // `multi_col` is the historical name for what is now referred to as 
`rehash` elsewhere

Review Comment:
   It's better to fix the naming than add a comment trying to explain the 
discrepency



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to