icexelloss commented on PR #36499:
URL: https://github.com/apache/arrow/pull/36499#issuecomment-1625411110

   > The key hasher already encapsulates the batch and its hashes, so I don't 
see how wrapping the two in a tuple gains anything. As for immutability, you'd 
need to explain what you have in mind in more detail (or try to implement it) 
because it could easily degrade performance, which is a major consideration 
here that greatly impacts the design. The current design has all processing, 
including for hashes, done on the processing thread; it sounds like the 
proposed immutability would shift the computation of hashes to the 
input-receiving thread, as well as add allocation/deallocation cost for the 
hashes, and these could easily degrade performance.
   
   Yes the key hasher contains effectively a tuple of <RecordBatch, Hash>. 
However the tuple is mutable, i.e., the class needs to make sure the Hash is 
indeed the hash of the RecordBatch (via Invalidate) and IIUC that's where bugs 
happen. (Either we have a batch and doesn't have the hash, or that we associate 
the wrong hash with a batch. What I am proposing as an alternative is to have 
key hasher class produces an *immutable* tuple of <RecordBatch, Hash> so we 
avoid this wrong hash problem all together. So the code basically becomes:
   
   Currently:
   We have queues of RecordBatches, when hash for a RecordBatch is needed, we 
call key hasher to get the hash (and that's where  the bug comes when we cache 
the wrong hash)
   
   What I am proposing:
   We have queues of immutable <RecordBatch, Hash>, the tuple of created when a 
batch is received (either in the input received thread or processing thread).  
Since we only popping batches from queue in the processing thread, we should 
always have the hash for the batch that we want to process.
   
   In terms of performance, in both cases, we compute hash once for each input 
batches, so I don't see why performance would degrade. The extra cost of what I 
proposing is basically allocating one tuple for each input batches, maybe + 
some shared pointer copies per batch, both seems fairly cheap.
   
   Do you not agree that this would simplify things?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to