This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/celeborn.git


The following commit(s) were added to refs/heads/main by this push:
     new 5b2030037 [CELEBORN-1664] Fix secret fetch failures after LEADER 
master failover
5b2030037 is described below

commit 5b2030037aa4f5245a7a947868d1a2af975f04ca
Author: YutingWang98 <[email protected]>
AuthorDate: Tue Oct 22 14:10:08 2024 -0500

    [CELEBORN-1664] Fix secret fetch failures after LEADER master failover
    
    ### What changes were proposed in this pull request?
    Fix a bug related to auth under master HA mode which would cause app 
failures when leader master restarts. Also, remove the secrets from memory 
after app lost.
    
    Previous implementation add the registration & secret info in leader 
Master's memory, and push to other masters though 
https://github.com/apache/celeborn/pull/2346. After leader restarts, the info 
will only be in Ratis (AbstractMetaManager), however app still fetch it from 
new leader's memory, and would fail to get it.
    
    Fix this by checking AbstractMetaManager's registration info if not found 
in memory, and properly authorize the app.
    
    ### Why are the changes needed?
    When auth enabled, and leader master restart, there will be "Registration 
information not found" error on app side, and failed to send heartbeat to 
master. It will cause app to be removed on server side after heartbeat timeout, 
causing job to fail.
    ```
    24/10/14 01:56:55 ERROR [celeborn-netty-rpc-connection-executor-3] 
client.TransportClientFactory: Exception while bootstrapping client after 71.4 
ms
    java.lang.RuntimeException: java.io.IOException: Exception in sendRpcSync 
to: celeborn-moka-test-manager-3/{ip}:9097
        at 
org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:110)
        at 
org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doSaslBootstrap(RegistrationClientBootstrap.java:228)
        at 
org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doBootstrap(RegistrationClientBootstrap.java:103)
        at 
org.apache.celeborn.common.network.client.TransportClientFactory.internalCreateClient(TransportClientFactory.java:307)
        at 
org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:205)
        at 
org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:133)
        at 
org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:212)
        at 
org.apache.celeborn.common.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:232)
        at 
org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
        at 
org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
    Caused by: java.io.IOException: Exception in sendRpcSync to: 
celeborn-moka-test-manager-3/{ip}:9097
        at 
org.apache.celeborn.common.network.client.TransportClient.sendRpcSync(TransportClient.java:324)
        at 
org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:95)
        ... 13 more
    Caused by: java.util.concurrent.ExecutionException: java.io.IOException: 
java.lang.RuntimeException: Registration information not found for 
spark-402a80be70f74455b01
        at 
org.apache.celeborn.common.network.sasl.CelebornSaslServer$DigestCallbackHandler.handle(CelebornSaslServer.java:142)
        at 
java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
        at 
java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
        at 
org.apache.celeborn.common.network.sasl.CelebornSaslServer.response(CelebornSaslServer.java:84)
        at 
org.apache.celeborn.common.network.sasl.SaslRpcHandler.doAuthChallenge(SaslRpcHandler.java:99)
        at 
org.apache.celeborn.common.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:58)
        at 
org.apache.celeborn.common.network.sasl.registration.RegistrationRpcHandler.processRpcMessage(RegistrationRpcHandler.java:175)
    ```
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Tested on dev cluster and job can properly get the secrets after master 
failover
    
    Closes #2826 from YutingWang98/fix_auth_master_ha.
    
    Authored-by: YutingWang98 <[email protected]>
    Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
---
 .../deploy/master/MasterSecretRegistryImpl.java    | 45 +++++++++++++++-------
 .../master/clustermeta/AbstractMetaManager.java    |  4 ++
 2 files changed, 36 insertions(+), 13 deletions(-)

diff --git 
a/master/src/main/java/org/apache/celeborn/service/deploy/master/MasterSecretRegistryImpl.java
 
b/master/src/main/java/org/apache/celeborn/service/deploy/master/MasterSecretRegistryImpl.java
index d78cf6628..0b9544c5c 100644
--- 
a/master/src/main/java/org/apache/celeborn/service/deploy/master/MasterSecretRegistryImpl.java
+++ 
b/master/src/main/java/org/apache/celeborn/service/deploy/master/MasterSecretRegistryImpl.java
@@ -22,29 +22,48 @@ import org.slf4j.LoggerFactory;
 
 import org.apache.celeborn.common.meta.ApplicationMeta;
 import org.apache.celeborn.common.network.sasl.SecretRegistry;
-import org.apache.celeborn.common.network.sasl.SecretRegistryImpl;
-import org.apache.celeborn.service.deploy.master.clustermeta.IMetadataHandler;
+import 
org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager;
 
 /**
- * A simple implementation of {@link SecretRegistry} that stores secrets in 
memory and Ratis. This
- * persists an application secret in Ratis but the deletion of that secret 
happens when
- * ApplicationLost is triggered.
+ * A simple implementation of {@link SecretRegistry} that stores secrets in 
Ratis. This persists an
+ * application secret in Ratis but the deletion of that secret happens when 
ApplicationLost is
+ * triggered.
  */
-public class MasterSecretRegistryImpl extends SecretRegistryImpl {
+public class MasterSecretRegistryImpl implements SecretRegistry {
 
   private static final Logger LOG = 
LoggerFactory.getLogger(MasterSecretRegistryImpl.class);
-  private IMetadataHandler metadataHandler;
+  private AbstractMetaManager statusSystem;
 
   @Override
   public void register(String appId, String secret) {
-    super.register(appId, secret);
-    if (metadataHandler != null) {
-      LOG.info("Persisting metadata for appId: {}", appId);
-      metadataHandler.handleApplicationMeta(new ApplicationMeta(appId, 
secret));
+    LOG.info("Persisting metadata for appId: {}", appId);
+    statusSystem.handleApplicationMeta(new ApplicationMeta(appId, secret));
+  }
+
+  @Override
+  public void unregister(String appId) {
+    LOG.info("Removing metadata for appId: {}", appId);
+    statusSystem.removeApplicationMeta(appId);
+  }
+
+  @Override
+  public String getSecretKey(String appId) {
+    String secret = null;
+    LOG.debug("Fetching secret from metadata manager for appId: {}", appId);
+    ApplicationMeta applicationMeta = statusSystem.applicationMetas.get(appId);
+    if (applicationMeta != null) {
+      secret = applicationMeta.secret();
     }
+    return secret;
+  }
+
+  @Override
+  public boolean isRegistered(String appId) {
+    LOG.info("Fetching registration status from metadata manager for appId: 
{}", appId);
+    return statusSystem.applicationMetas.containsKey(appId);
   }
 
-  void setMetadataHandler(IMetadataHandler metadataHandler) {
-    this.metadataHandler = metadataHandler;
+  void setMetadataHandler(AbstractMetaManager statusSystem) {
+    this.statusSystem = statusSystem;
   }
 }
diff --git 
a/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
 
b/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
index 5c36dd23f..67f41038c 100644
--- 
a/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
+++ 
b/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
@@ -538,6 +538,10 @@ public abstract class AbstractMetaManager implements 
IMetadataHandler {
     applicationMetas.putIfAbsent(applicationMeta.appId(), applicationMeta);
   }
 
+  public void removeApplicationMeta(String appId) {
+    applicationMetas.remove(appId);
+  }
+
   public int registeredShuffleCount() {
     return 
registeredAppAndShuffles.values().stream().mapToInt(Set::size).sum();
   }

Reply via email to