HeartSaVioR opened a new pull request #27010: [SPARK-30313][CORE] Ensure 
EndpointRef is available MasterWebUI/WorkerPage
URL: https://github.com/apache/spark/pull/27010
 
 
   ### What changes were proposed in this pull request?
   
   This patch fixes flaky tests "master/worker web ui available" & 
"master/worker web ui available with reverseProxy" in MasterSuite.
   
   Tracking back from stack trace, I found there's possible race condition in 
`Dispatcher.registerRpcEndpoint()`:
   
   
https://github.com/apache/spark/blob/481fb63f97d87d5b2e9e1f9b30bee466605b5a72/core/src/main/scala/org/apache/spark/rpc/netty/Dispatcher.scala#L64-L77
   
   `getMessageLoop()` initializes a new Inbox for this endpoint for both 
DedicatedMessageLoop 
    and SharedMessageLoop, which calls `onStart()`  "asynchronously" and 
"eventually" via posting `OnStart` message. `onStart()` will initialize UI page 
instance(s), so the execution of `endpointRefs.put()` and initializing UI page 
instance(s) are "concurrent".
   
   MasterPage and WorkerPage retrieve endpoint ref and store it as "val" 
assuming endpoint ref is valid when they're initialized - so in bad case they 
could store "null" as endpoint ref, and don't change.
   
   
https://github.com/apache/spark/blob/481fb63f97d87d5b2e9e1f9b30bee466605b5a72/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala#L33-L38
   
   
https://github.com/apache/spark/blob/481fb63f97d87d5b2e9e1f9b30bee466605b5a72/core/src/main/scala/org/apache/spark/deploy/worker/ui/WorkerPage.scala#L35-L41
   
   This patch simply adds loop to ensure endpoint ref is available when 
retrieving from MasterPage/WorkerPage, given the fact that endpoint ref 
assignment in Dispatcher won't take long.
   
   ### Why are the changes needed?
   
   We observed the test failures from Jenkins; below are the links:
   
   
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115583/testReport/
   
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115700/testReport/
   
   
   ### Does this PR introduce any user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Existing UTs. Unfortunately it's hard to reproduce as it's due to race 
condition and it barely fails.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to