rtpsw commented on PR #36499: URL: https://github.com/apache/arrow/pull/36499#issuecomment-1624849882
> @rtpsw The rationale behind the hash caching / invalidation is unclear to me - if the purpose of the KeyHasher here is to compute the hash for each row in the batch, why don't we always keep a immutable tuple of <RecordBatch, std::vector<HashType>> in the input queues / input state to avoid dealing cache invalidation / mutation here? The key hasher already encapsulates the batch and its hashes, so I don't see how wrapping the two in a tuple gains anything. As for immutability, you'd need to explain what you have in mind in more detail (or try to implement it) because it could easily degrade performance, which is a major consideration here that greatly impacts the design. The current design has all processing, including for hashes, is done on the processing thread; it sounds like the proposed immutability would shift the computation of hashes to the input-receiving thread, as well as add allocation/deallocation cost for the hashes, and these could easily degrade performance. > And I don't see clearly why we cannot use the tuple in both places to avoid the cache invalidation issue. We can and the code effectively does use the "tuple", encapsulated by the key hasher, in both these places. The main question is when and in which thread to compute the hashes. The current code's answer is: upon initial querying for the hashes (of a given batch) within the processing thread. > One thing that is not clear to me is why do we need rehash (row hash for each batch is deterministic so not clear to me why we need rehash at all) , but it seems to be probably can be avoided by using the recordbatch/hash tuple as well? This is an orthogonal topic. Rehashing occurs only when the key is a single fixed-width (up to 64 bits, for now) column in which a null value is observed for the first time. So, no rehashing ever occurs if the key spans multiple columns or if the key is a single fixed-width column with no null values. This is a performance optimization - the fast path is when the key is a single fixed-width column without any nulls, which is a common case for which no hashes are computed, since its fixed-width values can be compared directly. The code is being optimistic - it hopes for no null values until it sees the first one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
