icexelloss commented on PR #36499: URL: https://github.com/apache/arrow/pull/36499#issuecomment-1625599730
> > What I am proposing as an alternative is to have key hasher class produces an immutable tuple ... > > Yes, this was my understanding of your proposal when I wrote [this post](https://github.com/apache/arrow/pull/36499#issuecomment-1624849882), where I noted performance concerns. > > > The extra cost of what I proposing is basically allocating one tuple for each input batches, maybe + some shared pointer copies per batch, both seems fairly cheap. > > There's also additional deallocations, additional memory needed, additional costs due to computing in one thread and using the results in another thread, and probably more CPU cache misses. It's conceptually simpler but I don't think we can be sure it'd be more performant. > > It's hard to predict performance impact of changes to complex concurrent code. Ultimately, if we want to know, there's no escaping from implementing the alternative and trying it out. It's going to take some effort to change `RecordBatch` to a tuple in a bunch of places, so whether to go down this path is a decision, I guess for you, to make - let me know your call. Note that in this PR, I was aiming for a minimal change; in fact, it's not even new code but a reversion of part of a [previous PR](https://github.com/apache/arrow/pull/36094). Ok I think we agree that an immutable tuple design is simpler but maybe have performance implications here. For the sake of pushing this PR forward, I will try to understand your current approach. > Given this background, we can see that the problem is not about an invalid/pointer to a record batch but about invalid cached hashes. In the pre-PR code, the failure could happen with the following order of operations: The input-receiving thread pushes a record batch to an empty queue, placing it at the front, thus making it the current record batch being processed. The processing thread invokes AdvanceAndMemoize, which invokes GetLatestKey, which invokes GetKey, which invokes HashesFor, which finds hashes that are incorrect for the above batch. This causes the problem, which is a concurrency one because it is driven by a non-deterministic order of operations. In the case here, since the input thread doesn't not `batch_` variable in side `KeyHasher`. I would expect when the processing thread invokes `AdvanceAndMemoize`, it would find the `batch_` is NULLPTR and then compute the hash for the first batch. What do you mean by " finds hashes that are incorrect for the above batch" here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
