rtpsw commented on PR #36499:
URL: https://github.com/apache/arrow/pull/36499#issuecomment-1624849882

   > @rtpsw The rationale behind the hash caching / invalidation is unclear to 
me - if the purpose of the KeyHasher here is to compute the hash for each row 
in the batch, why don't we always keep a immutable tuple of <RecordBatch, 
std::vector<HashType>> in the input queues / input state to avoid dealing cache 
invalidation / mutation here?
   
   The key hasher already encapsulates the batch and its hashes, so I don't see 
how wrapping the two in a tuple gains anything. As for immutability, you'd need 
to explain what you have in mind in more detail (or try to implement it) 
because it could easily degrade performance, which is a major consideration 
here that greatly impacts the design. The current design has all processing, 
including for hashes, is done on the processing thread; it sounds like the 
proposed immutability would shift the computation of hashes to the 
input-receiving thread, as well as add allocation/deallocation cost for the 
hashes, and these could easily degrade performance.
   
   > And I don't see clearly why we cannot use the tuple in both places to 
avoid the cache invalidation issue.
   
   We can and the code effectively does use the "tuple", encapsulated by the 
key hasher, in both these places. The main question is when and in which thread 
to compute the hashes. The current code's answer is: upon initial querying for 
the hashes (of a given batch) within the processing thread.
   
   > One thing that is not clear to me is why do we need rehash (row hash for 
each batch is deterministic so not clear to me why we need rehash at all) , but 
it seems to be probably can be avoided by using the recordbatch/hash tuple as 
well?
   
   This is an orthogonal topic. Rehashing occurs only when the key is a single 
fixed-width (up to 64 bits, for now) column in which a null value is observed 
for the first time. So, no rehashing ever occurs if the key spans multiple 
columns or if the key is a single fixed-width column with no null values. This 
is a performance optimization - the fast path is when the key is a single 
fixed-width column without any nulls, which is a common case for which no 
hashes are computed, since its fixed-width values can be compared directly. The 
code is being optimistic - it hopes for no null values until it sees the first 
one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to