pitrou commented on code in PR #40484:
URL: https://github.com/apache/arrow/pull/40484#discussion_r1544470906
##########
cpp/src/arrow/compute/key_hash.cc:
##########
@@ -378,20 +378,40 @@ void Hashing32::HashFixed(int64_t hardware_flags, bool
combine_hashes, uint32_t
}
}
-void Hashing32::HashMultiColumn(const std::vector<KeyColumnArray>& cols,
- LightContext* ctx, uint32_t* hashes) {
- uint32_t num_rows = static_cast<uint32_t>(cols[0].length());
-
- constexpr uint32_t max_batch_size = util::MiniBatch::kMiniBatchLength;
+Status Hashing32::HashMultiColumn(const std::vector<KeyColumnArray>& cols,
+ LightContext* ctx, uint32_t* hashes) {
+ auto num_rows = static_cast<uint32_t>(cols[0].length());
+ // The max_batch_size represents the number of rows to be processed in each
iteration,
+ // and it is used to allocate enough space for the allocated TempVectorStack.
+ const auto max_batch_size =
+ std::min(num_rows,
static_cast<uint32_t>(util::MiniBatch::kMiniBatchLength));
+
+ // pre calculate alloc size in TempVectorStack for hash_temp_buf,
null_hash_temp_buf
+ // and null_indices_buf
+ const auto alloc_hash_temp_buf =
+ util::TempVectorStack::EstimateAllocationSize(max_batch_size *
sizeof(uint32_t));
+ const auto alloc_for_null_indices_buf =
+ util::TempVectorStack::EstimateAllocationSize(max_batch_size *
sizeof(uint16_t));
+ const auto alloc_size = alloc_hash_temp_buf * 2 + alloc_for_null_indices_buf;
+
+ std::unique_ptr<util::TempVectorStack> temp_stack(nullptr);
+ auto stack = ctx->stack;
+ if (!stack) {
Review Comment:
> So for this pr:
> We should mark the key_hash.h --> key_hash_internal.h and add them into
internal namespace?
Yes, +1 to this. We should make the same change for other similar
functionality IMHO (`key_map.h`, `light_array.h`).
> For the enhancement:
> Do you think we should provide users with a hash value that can generate
recordbatch-level?
We can. The API will just have to follow the usual Arrow C++ conventions,
and also be high-level enough.
Note there's another PR lying around for a long time, you might or might not
want to build on it: https://github.com/apache/arrow/pull/39836
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]