Rachelint commented on code in PR #11718:
URL: https://github.com/apache/datafusion/pull/11718#discussion_r1696979055


##########
datafusion/physical-plan/src/aggregates/group_values/row.rs:
##########
@@ -120,12 +120,13 @@ impl GroupValues for GroupValuesRows {
         batch_hashes.resize(n_rows, 0);
         create_hashes(cols, &self.random_state, batch_hashes)?;
 
-        for (row, &hash) in batch_hashes.iter().enumerate() {
-            let entry = self.map.get_mut(hash, |(_hash, group_idx)| {
+        for (row, &target_hash) in batch_hashes.iter().enumerate() {
+            let entry = self.map.get_mut(target_hash, |(exist_hash, 
group_idx)| {
                 // verify that a group that we are inserting with hash is
                 // actually the same key value as the group in
                 // existing_idx  (aka group_values @ row)
-                group_rows.row(row) == group_values.row(*group_idx)
+                target_hash == *exist_hash

Review Comment:
   > > I was curious -- I think the code @Rachelint is referring to in 
hashbrown is here
   > > 
https://github.com/rust-lang/hashbrown/blob/ac00a0bbef46f02f555e235f57ce263aefa361e0/src/raw/mod.rs#L2183-L2199
   > > It does appear that collisions could happen (it is doing some sort of 
abbreviated check by condensing the actual hash value to a byte or something), 
though I don't fully understand how it works.
   > > I wonder if there are other places we could use this observation 🤔
   > 
   > This talk explains the details of this one-byte abbreviation trick 
https://www.youtube.com/watch?v=ncHmEUmJZf4 , I vaguely remember they said when 
this 1 byte check is done, it's very likely to find the correct slot. Looks 
like it's not working well when the hash table size grows over some threshold?
   
   Seems make sense, maybe we should only add this check when found the hash 
table size is larger than a threshold 🤔 .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to