Re: [PR] GH-40431: [C++] Try to check/alloc the TempVectorStack size as HashBatch needed [arrow]

via GitHub Wed, 27 Mar 2024 23:55:21 -0700


kou commented on code in PR #40484:
URL: https://github.com/apache/arrow/pull/40484#discussion_r1542401276



##########
cpp/src/arrow/compute/util.cc:
##########
@@ -32,10 +32,10 @@ using internal::CpuInfo;
 namespace util {
 
 void TempVectorStack::alloc(uint32_t num_bytes, uint8_t** data, int* id) {
-  int64_t new_top = top_ + PaddedAllocationSize(num_bytes) + 2 * 
sizeof(uint64_t);
-  // Stack overflow check (see GH-39582).
+  const auto estimate_size = EstimateAllocationSize(num_bytes);

Review Comment:
   ```suggestion
     const auto estimated_size = EstimateAllocationSize(num_bytes);
   ```



##########
cpp/src/arrow/compute/key_hash_test.cc:
##########
@@ -311,5 +311,52 @@ TEST(VectorHash, FixedLengthTailByteSafety) {
   HashFixedLengthFrom(/*key_length=*/19, /*num_rows=*/64, /*start_row=*/63);
 }
 
+TEST(HashBatch, AllocTempStackAsNeeded) {
+  auto arr = arrow::ArrayFromJSON(arrow::int32(), "[9,2,6]");
+  const auto batch_size = static_cast<int32_t>(arr->length());
+  arrow::compute::ExecBatch exec_batch({arr}, batch_size);
+  auto ctx = arrow::compute::default_exec_context();
+  std::vector<arrow::compute::KeyColumnArray> temp_column_arrays;
+
+  // HashBatch using internally allocated buffer.
+  std::vector<uint32_t> hashes32(batch_size);
+  std::vector<uint64_t> hashes64(batch_size);
+  ASSERT_OK(arrow::compute::Hashing32::HashBatch(
+      exec_batch, hashes32.data(), temp_column_arrays, 
ctx->cpu_info()->hardware_flags(),
+      nullptr, 0, batch_size));
+  ASSERT_OK(arrow::compute::Hashing64::HashBatch(
+      exec_batch, hashes64.data(), temp_column_arrays, 
ctx->cpu_info()->hardware_flags(),
+      nullptr, 0, batch_size));
+
+  util::TempVectorStack hash32_stack, hash64_stack;
+  std::vector<uint32_t> new_hashes32(batch_size);
+  std::vector<uint64_t> new_hashes64(batch_size);
+
+  // HashBatch using pre-allocated buffer of insufficient size raises stack 
overflow.
+  ASSERT_OK(hash32_stack.Init(default_memory_pool(), batch_size));
+  ASSERT_NOT_OK(arrow::compute::Hashing32::HashBatch(
+      exec_batch, new_hashes32.data(), temp_column_arrays,
+      ctx->cpu_info()->hardware_flags(), &hash32_stack, 0, batch_size));
+  ASSERT_OK(hash64_stack.Init(default_memory_pool(), batch_size));
+  ASSERT_NOT_OK(arrow::compute::Hashing64::HashBatch(
+      exec_batch, new_hashes64.data(), temp_column_arrays,
+      ctx->cpu_info()->hardware_flags(), &hash64_stack, 0, batch_size));
+
+  // HashBatch using big enough pre-allocated buffer.
+  ASSERT_OK(hash32_stack.Init(default_memory_pool(), 1024));
+  ASSERT_OK(arrow::compute::Hashing32::HashBatch(
+      exec_batch, new_hashes32.data(), temp_column_arrays,
+      ctx->cpu_info()->hardware_flags(), &hash32_stack, 0, batch_size));
+  ASSERT_OK(hash64_stack.Init(default_memory_pool(), 1024));
+  ASSERT_OK(arrow::compute::Hashing64::HashBatch(
+      exec_batch, new_hashes64.data(), temp_column_arrays,
+      ctx->cpu_info()->hardware_flags(), &hash64_stack, 0, batch_size));
+
+  for (int i = 0; i < batch_size; i++) {
+    EXPECT_EQ(hashes32[i], new_hashes32[i]);
+    EXPECT_EQ(hashes64[i], new_hashes64[i]);

Review Comment:
   Can we use more meaningful variable names for them?
   I feel that using "new_" for "big enough pre-allocated buffer" is strange.
   
   `hashes32_auto`/`hashes32_explicit` or something?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-40431: [C++] Try to check/alloc the TempVectorStack size as HashBatch needed [arrow]

Reply via email to