sririshindra opened a new pull request #31477: URL: https://github.com/apache/spark/pull/31477
### What changes were proposed in this pull request? Adding a new "number of matched pairs metric" to track number of pairs processed out of join. ### Why are the changes needed? Often users face a scenario where even a modest skew in a join can lead to tasks appearing to be stuck, due to the O(n^2) nature of a join considering all pairs of rows with matching keys. When this happens users think that spark has gotten deadlocked. If there is a bound condition, the "number of output rows" metric may look typical. Other metrics may look very modest (eg: shuffle read). In those cases, it is very hard to understand what the problem is. There is no conclusive proof without getting a heap dump and looking at some internal data structures. It would be much better if spark had a metric(which we propose be titled “number of matched pairs” as a companion to “number of output rows”) which showed the user how many pairs were being processed in the join. This would get updated in the live UI (when metrics get collected during heartbeats), so the user could easily see what was going on. This would even help in cases where there was some other cause of a stuck executor (eg. network issues) just to disprove this theory. For example, you may have 100k records with the same key on each side of a join. That probably won't really show up as extreme skew in task input data. But it'll become 10B join pairs that spark works through, in one task. See the Jira description for an example. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to see the new "number of matched pairs" metric whenever they look at a join node in the UI. ### How was this patch tested? Added unit tests ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
