This is an automated email from the ASF dual-hosted git repository.
zanmato pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new 52b3184f29 MINOR: [C++] Fix the wrong comment in #45104 (#45111)
52b3184f29 is described below
commit 52b3184f29395d89224c84f16b52d1e66d4813aa
Author: Rossi Sun <[email protected]>
AuthorDate: Thu Dec 26 23:01:06 2024 +0800
MINOR: [C++] Fix the wrong comment in #45104 (#45111)
### Rationale for this change
In my last typo fix in #45104 I made a mistake:
https://github.com/apache/arrow/pull/45104/files#diff-a1f24396d20cd510ae48838e684fbc5ba7b23475702ae849a2d13bf48e5251aaR284
I was too careless. The size of a block in swiss table is much more subtle
than I thought. It turns out the original `12KB` was right.
### What changes are included in this PR?
Removing my last wrong change and elaborate a bit about how the number
`12KB` comes.
### Are these changes tested?
### Are there any user-facing changes?
Authored-by: Rossi Sun <[email protected]>
Signed-off-by: Rossi Sun <[email protected]>
---
cpp/src/arrow/compute/key_map_internal.cc | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/cpp/src/arrow/compute/key_map_internal.cc
b/cpp/src/arrow/compute/key_map_internal.cc
index 81f1543cef..f134c91455 100644
--- a/cpp/src/arrow/compute/key_map_internal.cc
+++ b/cpp/src/arrow/compute/key_map_internal.cc
@@ -281,13 +281,18 @@ void SwissTable::early_filter_imp(const int num_keys,
const uint32_t* hashes,
// When we reach this limit, we need to break processing of any further rows
and resize.
//
uint64_t SwissTable::num_groups_for_resize() const {
- // Resize small hash tables when 50% full (up to 32KB).
- // Resize large hash tables when 75% full.
+ // Consider N = 9 (aka 2 ^ 9 = 512 blocks) as small.
+ // When N = 9, a slot id takes N + 3 = 12 bits, rounded up to 16 bits. This
is also the
+ // number of bits needed for a key id. Since each slot stores a status byte
and a key
+ // id, then a slot takes 1 byte + 16 bits = 3 bytes. Therefore a block of 8
slots takes
+ // 24 bytes. The threshold of a small hash table ends up being 24 bytes *
512 = 12 KB.
constexpr int log_blocks_small_ = 9;
uint64_t num_slots = 1ULL << (log_blocks_ + 3);
if (log_blocks_ <= log_blocks_small_) {
+ // Resize small hash tables when 50% full.
return num_slots / 2;
} else {
+ // Resize large hash tables when 75% full.
return num_slots * 3 / 4;
}
}