rtpsw commented on PR #36499:
URL: https://github.com/apache/arrow/pull/36499#issuecomment-1627524361

   > If that is case, then I would argue this is not a concurrency issue, but 
rather, that the KeyHasher class is broken and failed to cache hash correctly - 
you can have the same wrong hash issue too if everything is on one thread
   
   I understand that if you take just the key hasher alone then the caching by 
address can (be driven to) fail using one thread. However, this argument is 
ignoring invalidation. I could make a similar argument that the KeyHasher with 
proper invalidation is correct and can be so even when everything is on one 
thread. I agree that it would be cleaner to completely remove the address 
checking from the key hasher, since invalidation makes this not needed, and 
instead just check for `nullptr` (see `if (batch_)` 
[here](https://github.com/apache/arrow/pull/36499#issuecomment-1623678477)) as 
set by the invalidation. I did not do it in the current version of this PR 
because my goal was the simplest change that fixes. I would prefer to do it in 
a separate PR, but we could decide to do this in one. In any case, when done it 
would eliminate the need for the above arguments.
   
   > therefore, I don't think moving things to another thread is the right fix.
   
   Presumably, you're referring to the invalidation from the input-receiving 
thread. Indeed, there is an alternative where the key hasher operations are all 
done from one thread, which we discussed starting in [this 
post](https://github.com/apache/arrow/pull/36499#issuecomment-1624267164). It 
may be considered better by some standard, certainly doable, yet also more 
complex. This is because, as discussed, `AdvanceAndMemoize` accesses the key 
hasher before calling `Advance`, which is where a new batch is handled and the 
key hasher can be invalidated, but for correctness we need to invalidate before 
accessing the key hasher. To deal with this in one thread, we'd need to detect 
a new batch at the front of the queue (in `AdvanceAndMemoize` before accessing 
the key hasher) as a trigger for invalidation. We can't do this by comparing 
batch addresses, so we'd probably need to change the queue to contain serially 
numbered batches (rather than just batches) and do the detection by comparing s
 erial numbers.
   
   Note that up until recently, I did not have access to a platform where I 
could reproduce the kind of failure discussed here. Now that I do, I'm more 
confident I could locally test changes relevant to this kind of failure.
   
   > I remembered we had this issue before and I thought one of your PR fixed 
this issue, but I cannot seem to find the change anymore. (There was one about 
Allocator reusing the buffer address for the second batch on MacOS or sth)
   
   My understanding of the timeline of relevant changes is:
   
   1. #34392 introduced invalidation, which was originally done from the 
input-receiving thread. IIRC, this is also the PR where we first observed an 
infrequent failure on macOS and fixed it.
   2. #36094 wished to simplify things by moving operations to one thread, but 
ended up introducing the bug we're dealing with here.
   3. #36499 (the current PR) reverts the moving of the key hasher invalidation.
   
   > Also, are you able to confirm that the failure happened pre-PR is because 
the second batch has the same address as the first batch?
   
   Right now I don't have debug-print confirmation of this or of the opposite. 
It would take some effort to get - let me know if you want this done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to