icexelloss commented on PR #36499:
URL: https://github.com/apache/arrow/pull/36499#issuecomment-1625599730

   > > What I am proposing as an alternative is to have key hasher class 
produces an immutable tuple  ...
   > 
   > Yes, this was my understanding of your proposal when I wrote [this 
post](https://github.com/apache/arrow/pull/36499#issuecomment-1624849882), 
where I noted performance concerns.
   > 
   > > The extra cost of what I proposing is basically allocating one tuple for 
each input batches, maybe + some shared pointer copies per batch, both seems 
fairly cheap.
   > 
   > There's also additional deallocations, additional memory needed, 
additional costs due to computing in one thread and using the results in 
another thread, and probably more CPU cache misses. It's conceptually simpler 
but I don't think we can be sure it'd be more performant.
   > 
   > It's hard to predict performance impact of changes to complex concurrent 
code. Ultimately, if we want to know, there's no escaping from implementing the 
alternative and trying it out. It's going to take some effort to change 
`RecordBatch` to a tuple in a bunch of places, so whether to go down this path 
is a decision, I guess for you, to make - let me know your call. Note that in 
this PR, I was aiming for a minimal change; in fact, it's not even new code but 
a reversion of part of a [previous 
PR](https://github.com/apache/arrow/pull/36094).
   
   Ok I think we agree that an immutable tuple design is simpler but maybe have 
performance implications here. For the sake of pushing this PR forward, I will 
try to understand your current approach.
   
   > Given this background, we can see that the problem is not about an 
invalid/pointer to a record batch but about invalid cached hashes. In the 
pre-PR code, the failure could happen with the following order of operations:
   
   The input-receiving thread pushes a record batch to an empty queue, placing 
it at the front, thus making it the current record batch being processed.
   The processing thread invokes AdvanceAndMemoize, which invokes GetLatestKey, 
which invokes GetKey, which invokes HashesFor, which finds hashes that are 
incorrect for the above batch. This causes the problem, which is a concurrency 
one because it is driven by a non-deterministic order of operations.
   
   In the case here, since the input thread doesn't not `batch_` variable in 
side `KeyHasher`. I would expect when the processing thread invokes 
`AdvanceAndMemoize`, it would find the `batch_` is NULLPTR and then compute the 
hash for the first batch. What do you mean by " finds hashes that are incorrect 
for the above batch" here? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to