Re: [PR] GH-40431: [C++] Try to check/alloc the TempVectorStack size as HashBatch needed [arrow]

via GitHub Wed, 27 Mar 2024 08:26:42 -0700


zanmato1984 commented on code in PR #40484:
URL: https://github.com/apache/arrow/pull/40484#discussion_r1541340745



##########
cpp/src/arrow/compute/key_hash.cc:
##########
@@ -378,24 +378,42 @@ void Hashing32::HashFixed(int64_t hardware_flags, bool 
combine_hashes, uint32_t
   }
 }
 
-void Hashing32::HashMultiColumn(const std::vector<KeyColumnArray>& cols,
-                                LightContext* ctx, uint32_t* hashes) {
-  uint32_t num_rows = static_cast<uint32_t>(cols[0].length());
-
-  constexpr uint32_t max_batch_size = util::MiniBatch::kMiniBatchLength;
+Status Hashing32::HashMultiColumn(const std::vector<KeyColumnArray>& cols,
+                                  LightContext* ctx, uint32_t* hashes) {
+  auto num_rows = static_cast<uint32_t>(cols[0].length());
+  const auto alloc_batch_size =

Review Comment:
   I would suggest keeping the name `max_batch_size`. It carries the meaning of 
how many rows to process in each iteration. In addition, this name is used 
everywhere in hash join related code so keeping it may complies with existing 
code base more. Last, it doesn't seem to be in the same category of the 
following three `alloc` family variables - we can think any `alloc` variable is 
solely to make sure the stack is large enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-40431: [C++] Try to check/alloc the TempVectorStack size as HashBatch needed [arrow]

Reply via email to