[I] [C++] Crashed at TempStack alloc when use Hashing32::HashBatch independently [arrow]

via GitHub Fri, 08 Mar 2024 23:06:07 -0800


ZhangHuiGui opened a new issue, #40431:
URL: https://github.com/apache/arrow/issues/40431


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   The issue is similar to https://github.com/apache/arrow/pull/40007, but they 
are different.
   I want to use the `Hashing32::HashBatch` api   for produce a hash-array for 
a batch. Although the `Hashing32` and `Hashing64` are used in join based codes, 
but they can be used independently.
   
   Like below codes:
   ```c
     auto arr = arrow::ArrayFromJSON(arrow::int32(), "[9,2,6]");
     const int batch_len = arr->length();
     arrow::compute::ExecBatch exec_batch({arr}, batch_len);
     auto ctx = arrow::compute::default_exec_context();
     arrow::util::TempVectorStack stack;
     ASSERT_OK(stack.Init(ctx->memory_pool(), batch_len * sizeof(uint32_t))); 
// I just alloc the stack size as i needed.
   
     std::vector<uint32_t> hashes(batch_len);
     std::vector<arrow::compute::KeyColumnArray> temp_column_arrays;
     ASSERT_OK(arrow::compute::Hashing32::HashBatch(
         exec_batch, hashes.data(), temp_column_arrays,
         ctx->cpu_info()->hardware_flags(), &stack, 0, batch_len));
   ```
   
   The crash stack in `HashBatch` is:
   ```shell
   arrow::compute::Hashing32::HashBatch
     arrow::compute::Hashing32::HashMultiColumn
         arrow::util::TempVectorHolder<unsigned int>::TempVectorHolder
           arrow::util::TempVectorStack::alloc
             ARROW_DCHECK(top_ <= buffer_size_); // top_=4176, buffer_size_=160
   ```
   
   The reason is blow codes:
   
https://github.com/apache/arrow/blob/7e286dd004a8fcf2de0f58615793338076741208/cpp/src/arrow/compute/key_hash.cc#L385-L387
   
   The holder use the `max_batch_size` which is `1024` as it's num_elements, 
it's far more than the temp stack's init `buffer_size`.
   
   I know that the `HashBatch` is only used in hash-join or related codes. For 
join, they have already done line clipping at the upper level, ensuring that 
each input batch size is less_equal to `kMiniBatchLength` and the stack size is 
bigger enough.
   
   But it can be used independently. So maybe we could use the `num_rows`  
rather than `util::MiniBatch::kMiniBatchLength` in `HashBatch` related apis?
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [C++] Crashed at TempStack alloc when use Hashing32::HashBatch independently [arrow]

Reply via email to