rtpsw commented on PR #36499: URL: https://github.com/apache/arrow/pull/36499#issuecomment-1624267164
> Hmm, this could probably have been fixed more idiomatically by storing a std::weak_ptr<RecordBatch> in KeyHasher. And you wouldn't have had to worry about concurrency. Unfortunately, it's not that simple. The concurrency problem is not caused by an invalid/dangling pointer to a record batch that a weak pointer would have avoided. I'll try to answer both your and Westons's questions. Some background first. The key hasher is used by the processing thread and caches hashes corresponding to a particular record batch. Therefore, the key hasher (i.e., its cache of hashes) should be invalidated when the record batch being handled by the processing thread changes. The complexity here is that this change is affected by the input-receiving thread. The post-PR code invalidates the key hasher just before this point using a thread-synchronizing operation. This ensures that the cached hashes at this point would not be used by the processing thread calling `HashesFor`; instead, `HashesFor` will compute the hashes for the new record batch the next time `HashesFor` would be invoked. Given this background, we can see that the problem is not about an invalid/pointer to a record batch but about invalid cached hashes. In the pre-PR code, the failure could happen with the following order of operations: 1. The input-receiving thread pushes a record batch to an empty queue, placing it at the front, thus making it the current record batch being processed. 2. The processing thread invokes `AdvanceAndMemoize`, which invokes `GetLatestKey`, which invokes `GetKey`, which invokes `HashesFor`, which finds hashes that are incorrect for the above batch. This causes the problem, which is a concurrency one because it is driven by a non-deterministic order of operations. Only thereafter, the processing thread, still in `AdvanceAndMemoize`, invokes `Advance`, which deals with the queue and could detect a new record batch in order to then invalidate the key hasher, but it is too late due to item 2 above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
