This is an automated email from the ASF dual-hosted git repository.

zanmato pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new 52b3184f29 MINOR: [C++] Fix the wrong comment in #45104 (#45111)
52b3184f29 is described below

commit 52b3184f29395d89224c84f16b52d1e66d4813aa
Author: Rossi Sun <[email protected]>
AuthorDate: Thu Dec 26 23:01:06 2024 +0800

    MINOR: [C++] Fix the wrong comment in #45104 (#45111)
    
    
    
    ### Rationale for this change
    
    In my last typo fix in #45104 I made a mistake: 
https://github.com/apache/arrow/pull/45104/files#diff-a1f24396d20cd510ae48838e684fbc5ba7b23475702ae849a2d13bf48e5251aaR284
 I was too careless. The size of a block in swiss table is much more subtle 
than I thought. It turns out the original `12KB` was right.
    
    ### What changes are included in this PR?
    
    Removing my last wrong change and elaborate a bit about how the number 
`12KB` comes.
    
    ### Are these changes tested?
    
    ### Are there any user-facing changes?
    
    Authored-by: Rossi Sun <[email protected]>
    Signed-off-by: Rossi Sun <[email protected]>
---
 cpp/src/arrow/compute/key_map_internal.cc | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/cpp/src/arrow/compute/key_map_internal.cc 
b/cpp/src/arrow/compute/key_map_internal.cc
index 81f1543cef..f134c91455 100644
--- a/cpp/src/arrow/compute/key_map_internal.cc
+++ b/cpp/src/arrow/compute/key_map_internal.cc
@@ -281,13 +281,18 @@ void SwissTable::early_filter_imp(const int num_keys, 
const uint32_t* hashes,
 // When we reach this limit, we need to break processing of any further rows 
and resize.
 //
 uint64_t SwissTable::num_groups_for_resize() const {
-  // Resize small hash tables when 50% full (up to 32KB).
-  // Resize large hash tables when 75% full.
+  // Consider N = 9 (aka 2 ^ 9 = 512 blocks) as small.
+  // When N = 9, a slot id takes N + 3 = 12 bits, rounded up to 16 bits. This 
is also the
+  // number of bits needed for a key id. Since each slot stores a status byte 
and a key
+  // id, then a slot takes 1 byte + 16 bits = 3 bytes. Therefore a block of 8 
slots takes
+  // 24 bytes. The threshold of a small hash table ends up being 24 bytes * 
512 = 12 KB.
   constexpr int log_blocks_small_ = 9;
   uint64_t num_slots = 1ULL << (log_blocks_ + 3);
   if (log_blocks_ <= log_blocks_small_) {
+    // Resize small hash tables when 50% full.
     return num_slots / 2;
   } else {
+    // Resize large hash tables when 75% full.
     return num_slots * 3 / 4;
   }
 }

Reply via email to