alamb commented on code in PR #8779:
URL: https://github.com/apache/arrow-datafusion/pull/8779#discussion_r1445277690
##########
datafusion/physical-expr/src/aggregate/count_distinct.rs:
##########
@@ -83,6 +83,11 @@ macro_rules! float_distinct_count_accumulator {
}};
}
+/// Returns the estimated number of hashbrown hashtables.
Review Comment:
I think we could potentially re-use the comments here:
https://github.com/apache/arrow-datafusion/blob/819d3577872a082f2aea7a68ae83d68534049662/datafusion/physical-plan/src/joins/hash_join.rs#L734-L749
I think the value of this PR / issue is to consolidate the logic in
`datafusion/physical-expr/src/aggregate/count_distinct.rs` with the logic in
`arrow-datafusion/datafusion/physical-plan/src/joins
/hash_join.rs` with comments explaining the rationale (aka answering
@crepererum 's in comments)
To that end what would you think about:
1. Adding the code to
`arrow-datafusion/datafusion/physical-plan/src/common.rs`
2. Use the comments from
https://github.com/apache/arrow-datafusion/blob/819d3577872a082f2aea7a68ae83d68534049662/datafusion/physical-plan/src/joins/hash_join.rs#L734-L749
to explain the calculation
3. Change the code in `hash_join.rs` to use it too
I think this may require changing the signature to something like
```rust
/// Estimates the memory allocated by a [`hashbrown::HashTable`].
///
/// (add explanation about size calculation here)
///
/// Note a [`hashbrown::HashSet`] is implemented as a HashTable with a zero
sized key
pub fn estimated_hashtable_size<T>(table: &HashTable<T, RandomState>) ->
usize {
...
}
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]