rtpsw commented on PR #36499: URL: https://github.com/apache/arrow/pull/36499#issuecomment-1627524361
> If that is case, then I would argue this is not a concurrency issue, but rather, that the KeyHasher class is broken and failed to cache hash correctly - you can have the same wrong hash issue too if everything is on one thread I understand that if you take just the key hasher alone then the caching by address can (be driven to) fail using one thread. However, this argument is ignoring invalidation. I could make a similar argument that the KeyHasher with proper invalidation is correct and can be so even when everything is on one thread. I agree that it would be cleaner to completely remove the address checking from the key hasher, since invalidation makes this not needed, and instead just check for `nullptr` (see `if (batch_)` [here](https://github.com/apache/arrow/pull/36499#issuecomment-1623678477)) as set by the invalidation. I did not do it in the current version of this PR because my goal was the simplest change that fixes. I would prefer to do it in a separate PR, but we could decide to do this in one. In any case, when done it would eliminate the need for the above arguments. > therefore, I don't think moving things to another thread is the right fix. Presumably, you're referring to the invalidation from the input-receiving thread. Indeed, there is an alternative where the key hasher operations are all done from one thread, which we discussed starting in [this post](https://github.com/apache/arrow/pull/36499#issuecomment-1624267164). It may be considered better by some standard, certainly doable, yet also more complex. This is because, as discussed, `AdvanceAndMemoize` accesses the key hasher before calling `Advance`, which is where a new batch is handled and the key hasher can be invalidated, but for correctness we need to invalidate before accessing the key hasher. To deal with this in one thread, we'd need to detect a new batch at the front of the queue (in `AdvanceAndMemoize` before accessing the key hasher) as a trigger for invalidation. We can't do this by comparing batch addresses, so we'd probably need to change the queue to contain serially numbered batches (rather than just batches) and do the detection by comparing s erial numbers. Note that up until recently, I did not have access to a platform where I could reproduce the kind of failure discussed here. Now that I do, I'm more confident I could locally test changes relevant to this kind of failure. > I remembered we had this issue before and I thought one of your PR fixed this issue, but I cannot seem to find the change anymore. (There was one about Allocator reusing the buffer address for the second batch on MacOS or sth) My understanding of the timeline of relevant changes is: 1. #34392 introduced invalidation, which was originally done from the input-receiving thread. IIRC, this is also the PR where we first observed an infrequent failure on macOS and fixed it. 2. #36094 wished to simplify things by moving operations to one thread, but ended up introducing the bug we're dealing with here. 3. #36499 (the current PR) reverts the moving of the key hasher invalidation. > Also, are you able to confirm that the failure happened pre-PR is because the second batch has the same address as the first batch? Right now I don't have debug-print confirmation of this or of the opposite. It would take some effort to get - let me know if you want this done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
