[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #7029: Don't store hashes in GroupOrdering

via GitHub Wed, 19 Jul 2023 11:20:06 -0700


alamb commented on code in PR #7029:
URL: https://github.com/apache/arrow-datafusion/pull/7029#discussion_r1268484687



##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -624,15 +623,15 @@ impl GroupedHashAggregateStream {
                 }
                 std::mem::swap(&mut new_group_values, &mut self.group_values);
 
-                // rebuild hash table (maybe we should remove the
-                // entries for each group that was emitted rather than
-                // rebuilding the whole thing
-
-                let hashes = self.group_ordering.remove_groups(n);
-                assert_eq!(hashes.len(), self.group_values.num_rows());
-                self.map.clear();
-                for (idx, &hash) in hashes.iter().enumerate() {
-                    self.map.insert(hash, (hash, idx), |(hash, _)| *hash);
+                self.group_ordering.remove_groups(n);
+                // SAFETY: self.map outlives iterator and is not modified 
concurrently

Review Comment:
   I double checked: 
https://docs.rs/hashbrown/latest/hashbrown/raw/struct.RawTable.html#method.iter 
👍 



##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -624,15 +623,15 @@ impl GroupedHashAggregateStream {
                 }
                 std::mem::swap(&mut new_group_values, &mut self.group_values);
 
-                // rebuild hash table (maybe we should remove the
-                // entries for each group that was emitted rather than
-                // rebuilding the whole thing
-
-                let hashes = self.group_ordering.remove_groups(n);
-                assert_eq!(hashes.len(), self.group_values.num_rows());
-                self.map.clear();
-                for (idx, &hash) in hashes.iter().enumerate() {
-                    self.map.insert(hash, (hash, idx), |(hash, _)| *hash);
+                self.group_ordering.remove_groups(n);
+                // SAFETY: self.map outlives iterator and is not modified 
concurrently

Review Comment:
   I double checked: 
https://docs.rs/hashbrown/latest/hashbrown/raw/struct.RawTable.html#method.iter 
👍 



##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -624,15 +623,15 @@ impl GroupedHashAggregateStream {
                 }
                 std::mem::swap(&mut new_group_values, &mut self.group_values);
 
-                // rebuild hash table (maybe we should remove the
-                // entries for each group that was emitted rather than
-                // rebuilding the whole thing
-
-                let hashes = self.group_ordering.remove_groups(n);
-                assert_eq!(hashes.len(), self.group_values.num_rows());
-                self.map.clear();
-                for (idx, &hash) in hashes.iter().enumerate() {
-                    self.map.insert(hash, (hash, idx), |(hash, _)| *hash);
+                self.group_ordering.remove_groups(n);
+                // SAFETY: self.map outlives iterator and is not modified 
concurrently
+                unsafe {
+                    for bucket in self.map.iter() {
+                        match bucket.as_ref().1.checked_sub(n) {
+                            None => self.map.erase(bucket),
+                            Some(sub) => bucket.as_mut().1 = sub,
+                        }
+                    }

Review Comment:
   I think this is both wonderfully elegant as well as cryptic. How about some 
comments (this is so I don't have to refigure this out the next time I see this 
code):
   
   ```suggestion
                   unsafe {
                       for bucket in self.map.iter() {
                           // decrement group index by n                     
                           match bucket.as_ref().1.checked_sub(n) {
                               // group index was < n, so remove from table
                               None => self.map.erase(bucket),
                               // group index was >= n, shift value down
                               Some(sub) => bucket.as_mut().1 = sub,
                           }
                       }
   ```
   
   I double checked 
https://docs.rs/hashbrown/latest/hashbrown/raw/struct.RawIter.html
   
   ```
   You must not free the hash table while iterating (including via 
growing/shrinking).
   It is fine to erase a bucket that has been yielded by the iterator.
   Erasing a bucket that has not yet been yielded by the iterator may still 
result in the iterator yielding that bucket (unless reflect_remove is called).
   It is unspecified whether an element inserted after the iterator was created 
will be yielded by that iterator (unless reflect_insert is called).
   The order in which the iterator yields bucket is unspecified and may change 
in the future.
   ```
   
   Which seems to be followed 👍 



##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -624,15 +623,15 @@ impl GroupedHashAggregateStream {
                 }
                 std::mem::swap(&mut new_group_values, &mut self.group_values);
 
-                // rebuild hash table (maybe we should remove the
-                // entries for each group that was emitted rather than
-                // rebuilding the whole thing
-
-                let hashes = self.group_ordering.remove_groups(n);
-                assert_eq!(hashes.len(), self.group_values.num_rows());
-                self.map.clear();
-                for (idx, &hash) in hashes.iter().enumerate() {
-                    self.map.insert(hash, (hash, idx), |(hash, _)| *hash);
+                self.group_ordering.remove_groups(n);
+                // SAFETY: self.map outlives iterator and is not modified 
concurrently
+                unsafe {
+                    for bucket in self.map.iter() {
+                        match bucket.as_ref().1.checked_sub(n) {
+                            None => self.map.erase(bucket),
+                            Some(sub) => bucket.as_mut().1 = sub,
+                        }
+                    }

Review Comment:
   I think this is both wonderfully elegant as well as cryptic. How about some 
comments (this is so I don't have to refigure this out the next time I see this 
code):
   
   ```suggestion
                   unsafe {
                       for bucket in self.map.iter() {
                           // decrement group index by n                     
                           match bucket.as_ref().1.checked_sub(n) {
                               // group index was < n, so remove from table
                               None => self.map.erase(bucket),
                               // group index was >= n, shift value down
                               Some(sub) => bucket.as_mut().1 = sub,
                           }
                       }
   ```
   
   I double checked 
https://docs.rs/hashbrown/latest/hashbrown/raw/struct.RawIter.html
   
   ```
   You must not free the hash table while iterating (including via 
growing/shrinking).
   It is fine to erase a bucket that has been yielded by the iterator.
   Erasing a bucket that has not yet been yielded by the iterator may still 
result in the iterator yielding that bucket (unless reflect_remove is called).
   It is unspecified whether an element inserted after the iterator was created 
will be yielded by that iterator (unless reflect_insert is called).
   The order in which the iterator yields bucket is unspecified and may change 
in the future.
   ```
   
   Which seems to be followed 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #7029: Don't store hashes in GroupOrdering

Reply via email to