Re: [PR] GH-40431: [C++] Try to check/alloc the TempVectorStack size as HashBatch needed [arrow]

via GitHub Thu, 28 Mar 2024 07:04:03 -0700


ZhangHuiGui commented on code in PR #40484:
URL: https://github.com/apache/arrow/pull/40484#discussion_r1543037286



##########
cpp/src/arrow/compute/key_hash.cc:
##########
@@ -378,20 +378,40 @@ void Hashing32::HashFixed(int64_t hardware_flags, bool 
combine_hashes, uint32_t
   }
 }
 
-void Hashing32::HashMultiColumn(const std::vector<KeyColumnArray>& cols,
-                                LightContext* ctx, uint32_t* hashes) {
-  uint32_t num_rows = static_cast<uint32_t>(cols[0].length());
-
-  constexpr uint32_t max_batch_size = util::MiniBatch::kMiniBatchLength;
+Status Hashing32::HashMultiColumn(const std::vector<KeyColumnArray>& cols,
+                                  LightContext* ctx, uint32_t* hashes) {
+  auto num_rows = static_cast<uint32_t>(cols[0].length());
+  // The max_batch_size represents the number of rows to be processed in each 
iteration,
+  // and it is used to allocate enough space for the allocated TempVectorStack.
+  const auto max_batch_size =
+      std::min(num_rows, 
static_cast<uint32_t>(util::MiniBatch::kMiniBatchLength));
+
+  // pre calculate alloc size in TempVectorStack for hash_temp_buf, 
null_hash_temp_buf
+  // and null_indices_buf
+  const auto alloc_hash_temp_buf =
+      util::TempVectorStack::EstimateAllocationSize(max_batch_size * 
sizeof(uint32_t));
+  const auto alloc_for_null_indices_buf =
+      util::TempVectorStack::EstimateAllocationSize(max_batch_size * 
sizeof(uint16_t));
+  const auto alloc_size = alloc_hash_temp_buf * 2 + alloc_for_null_indices_buf;
+
+  std::unique_ptr<util::TempVectorStack> temp_stack(nullptr);
+  auto stack = ctx->stack;
+  if (!stack) {

Review Comment:
   The PR's enhancement and fix:
   1. For user case: Efficiently calculating the hash value of a recordbatch is 
a relatively common operation (for example, if users want to achieve 
recordbatch-level uniqueness, they can use the key_hash.h and key_map.h related 
excuses to complete). Even though it is currently only used in grouper and 
hashjoin, it is not This interface can be used directly by users. I think it's 
reasonable to support this?
   
   3. Fixed the bug of users using this interface. The user did not know the 
size of TempVectorStack that needs to be allocated inside HashBatch, and passed 
in a smaller value, which would cause a crash. Like below basic case:
   ```c
   TEST(HashBatch, BasicTest) {
     auto arr = arrow::ArrayFromJSON(arrow::int32(), "[9,2,6]");
     const int batch_len = arr->length();
     arrow::compute::ExecBatch exec_batch({arr}, batch_len);
     auto ctx = arrow::compute::default_exec_context();
     arrow::util::TempVectorStack stack;
     ASSERT_OK(stack.Init(ctx->memory_pool(), batch_len * sizeof(uint32_t)));
   
     std::vector<uint32_t> hashes(batch_len);
     std::vector<arrow::compute::KeyColumnArray> temp_column_arrays;
     ASSERT_OK(arrow::compute::Hashing32::HashBatch(
         exec_batch, hashes.data(), temp_column_arrays,
         ctx->cpu_info()->hardware_flags(), &stack, 0, batch_len));
   
     for (int i = 0; i < batch_len; i++) {
       std::cout << hashes[i] << " ";
     }
   }
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-40431: [C++] Try to check/alloc the TempVectorStack size as HashBatch needed [arrow]

Reply via email to