HeartSaVioR opened a new pull request #27010: [SPARK-30313][CORE] Ensure EndpointRef is available MasterWebUI/WorkerPage URL: https://github.com/apache/spark/pull/27010 ### What changes were proposed in this pull request? This patch fixes flaky tests "master/worker web ui available" & "master/worker web ui available with reverseProxy" in MasterSuite. Tracking back from stack trace, I found there's possible race condition in `Dispatcher.registerRpcEndpoint()`: https://github.com/apache/spark/blob/481fb63f97d87d5b2e9e1f9b30bee466605b5a72/core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala#L64-L77 `getMessageLoop()` initializes a new Inbox for this endpoint for both DedicatedMessageLoop and SharedMessageLoop, which calls `onStart()` "asynchronously" and "eventually" via posting `OnStart` message. `onStart()` will initialize UI page instance(s), so the execution of `endpointRefs.put()` and initializing UI page instance(s) are "concurrent". MasterPage and WorkerPage retrieve endpoint ref and store it as "val" assuming endpoint ref is valid when they're initialized - so in bad case they could store "null" as endpoint ref, and don't change. https://github.com/apache/spark/blob/481fb63f97d87d5b2e9e1f9b30bee466605b5a72/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala#L33-L38 https://github.com/apache/spark/blob/481fb63f97d87d5b2e9e1f9b30bee466605b5a72/core/src/main/scala/org/apache/spark/deploy/worker/ui/WorkerPage.scala#L35-L41 This patch simply adds loop to ensure endpoint ref is available when retrieving from MasterPage/WorkerPage, given the fact that endpoint ref assignment in Dispatcher won't take long. ### Why are the changes needed? We observed the test failures from Jenkins; below are the links: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115583/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115700/testReport/ ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Unfortunately it's hard to reproduce as it's due to race condition and it barely fails.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
