YutingWang98 opened a new pull request, #2826:
URL: https://github.com/apache/celeborn/pull/2826

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     - Make sure the PR title start w/ a JIRA ticket, e.g. '[CELEBORN-XXXX] 
Your PR title ...'.
     - Be sure to keep the PR description updated to reflect all changes.
     - Please write your PR title to summarize what this PR proposes.
     - If possible, provide a concise example to reproduce the issue for a 
faster review.
   -->
   
   ### What changes were proposed in this pull request?
   Fix a bug related to auth under master HA mode which would cause app 
failures when leader master restarts. Also, remove the secrets from memory 
after app lost.
   
   Previous implementation add the registration & secret info in leader 
Master's memory, and push to other masters though 
https://github.com/apache/celeborn/pull/2346. After leader restarts, the info 
will only be in Ratis (AbstractMetaManager), however app still fetch it from 
new leader's memory, and would fail to get it.
   
   Fix this by checking AbstractMetaManager's registration info if not found in 
memory, and properly authorize the app.
   
   
   ### Why are the changes needed?
   When auth enabled, and leader master restart, there will be "Registration 
information not found" error on app side, and failed to send heartbeat to 
master. It will cause app to be removed on server side after heartbeat timeout, 
causing job to fail.
   ```
   24/10/14 01:56:55 ERROR [celeborn-netty-rpc-connection-executor-3] 
client.TransportClientFactory: Exception while bootstrapping client after 71.4 
ms
   java.lang.RuntimeException: java.io.IOException: Exception in sendRpcSync 
to: celeborn-moka-test-manager-3/{ip}:9097
       at 
org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:110)
       at 
org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doSaslBootstrap(RegistrationClientBootstrap.java:228)
       at 
org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doBootstrap(RegistrationClientBootstrap.java:103)
       at 
org.apache.celeborn.common.network.client.TransportClientFactory.internalCreateClient(TransportClientFactory.java:307)
       at 
org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:205)
       at 
org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:133)
       at 
org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:212)
       at 
org.apache.celeborn.common.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:232)
       at 
org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
       at 
org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
       at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
       at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
       at java.base/java.lang.Thread.run(Thread.java:829)
   Caused by: java.io.IOException: Exception in sendRpcSync to: 
celeborn-moka-test-manager-3/{ip}:9097
       at 
org.apache.celeborn.common.network.client.TransportClient.sendRpcSync(TransportClient.java:324)
       at 
org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:95)
       ... 13 more
   Caused by: java.util.concurrent.ExecutionException: java.io.IOException: 
java.lang.RuntimeException: Registration information not found for 
spark-402a80be70f74455b01
       at 
org.apache.celeborn.common.network.sasl.CelebornSaslServer$DigestCallbackHandler.handle(CelebornSaslServer.java:142)
       at 
java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
       at 
java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
       at 
org.apache.celeborn.common.network.sasl.CelebornSaslServer.response(CelebornSaslServer.java:84)
       at 
org.apache.celeborn.common.network.sasl.SaslRpcHandler.doAuthChallenge(SaslRpcHandler.java:99)
       at 
org.apache.celeborn.common.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:58)
       at 
org.apache.celeborn.common.network.sasl.registration.RegistrationRpcHandler.processRpcMessage(RegistrationRpcHandler.java:175)
 
   ```
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Tested on dev cluster and job can properly get the secrets after master 
failover
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to