This is an automated email from the ASF dual-hosted git repository.
mridulm80 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/celeborn.git
The following commit(s) were added to refs/heads/main by this push:
new 5b2030037 [CELEBORN-1664] Fix secret fetch failures after LEADER
master failover
5b2030037 is described below
commit 5b2030037aa4f5245a7a947868d1a2af975f04ca
Author: YutingWang98 <[email protected]>
AuthorDate: Tue Oct 22 14:10:08 2024 -0500
[CELEBORN-1664] Fix secret fetch failures after LEADER master failover
### What changes were proposed in this pull request?
Fix a bug related to auth under master HA mode which would cause app
failures when leader master restarts. Also, remove the secrets from memory
after app lost.
Previous implementation add the registration & secret info in leader
Master's memory, and push to other masters though
https://github.com/apache/celeborn/pull/2346. After leader restarts, the info
will only be in Ratis (AbstractMetaManager), however app still fetch it from
new leader's memory, and would fail to get it.
Fix this by checking AbstractMetaManager's registration info if not found
in memory, and properly authorize the app.
### Why are the changes needed?
When auth enabled, and leader master restart, there will be "Registration
information not found" error on app side, and failed to send heartbeat to
master. It will cause app to be removed on server side after heartbeat timeout,
causing job to fail.
```
24/10/14 01:56:55 ERROR [celeborn-netty-rpc-connection-executor-3]
client.TransportClientFactory: Exception while bootstrapping client after 71.4
ms
java.lang.RuntimeException: java.io.IOException: Exception in sendRpcSync
to: celeborn-moka-test-manager-3/{ip}:9097
at
org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:110)
at
org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doSaslBootstrap(RegistrationClientBootstrap.java:228)
at
org.apache.celeborn.common.network.sasl.registration.RegistrationClientBootstrap.doBootstrap(RegistrationClientBootstrap.java:103)
at
org.apache.celeborn.common.network.client.TransportClientFactory.internalCreateClient(TransportClientFactory.java:307)
at
org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:205)
at
org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:133)
at
org.apache.celeborn.common.network.client.TransportClientFactory.createClient(TransportClientFactory.java:212)
at
org.apache.celeborn.common.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:232)
at
org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at
org.apache.celeborn.common.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Exception in sendRpcSync to:
celeborn-moka-test-manager-3/{ip}:9097
at
org.apache.celeborn.common.network.client.TransportClient.sendRpcSync(TransportClient.java:324)
at
org.apache.celeborn.common.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:95)
... 13 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException:
java.lang.RuntimeException: Registration information not found for
spark-402a80be70f74455b01
at
org.apache.celeborn.common.network.sasl.CelebornSaslServer$DigestCallbackHandler.handle(CelebornSaslServer.java:142)
at
java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
at
java.security.sasl/com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
at
org.apache.celeborn.common.network.sasl.CelebornSaslServer.response(CelebornSaslServer.java:84)
at
org.apache.celeborn.common.network.sasl.SaslRpcHandler.doAuthChallenge(SaslRpcHandler.java:99)
at
org.apache.celeborn.common.network.server.AbstractAuthRpcHandler.receive(AbstractAuthRpcHandler.java:58)
at
org.apache.celeborn.common.network.sasl.registration.RegistrationRpcHandler.processRpcMessage(RegistrationRpcHandler.java:175)
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested on dev cluster and job can properly get the secrets after master
failover
Closes #2826 from YutingWang98/fix_auth_master_ha.
Authored-by: YutingWang98 <[email protected]>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
---
.../deploy/master/MasterSecretRegistryImpl.java | 45 +++++++++++++++-------
.../master/clustermeta/AbstractMetaManager.java | 4 ++
2 files changed, 36 insertions(+), 13 deletions(-)
diff --git
a/master/src/main/java/org/apache/celeborn/service/deploy/master/MasterSecretRegistryImpl.java
b/master/src/main/java/org/apache/celeborn/service/deploy/master/MasterSecretRegistryImpl.java
index d78cf6628..0b9544c5c 100644
---
a/master/src/main/java/org/apache/celeborn/service/deploy/master/MasterSecretRegistryImpl.java
+++
b/master/src/main/java/org/apache/celeborn/service/deploy/master/MasterSecretRegistryImpl.java
@@ -22,29 +22,48 @@ import org.slf4j.LoggerFactory;
import org.apache.celeborn.common.meta.ApplicationMeta;
import org.apache.celeborn.common.network.sasl.SecretRegistry;
-import org.apache.celeborn.common.network.sasl.SecretRegistryImpl;
-import org.apache.celeborn.service.deploy.master.clustermeta.IMetadataHandler;
+import
org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager;
/**
- * A simple implementation of {@link SecretRegistry} that stores secrets in
memory and Ratis. This
- * persists an application secret in Ratis but the deletion of that secret
happens when
- * ApplicationLost is triggered.
+ * A simple implementation of {@link SecretRegistry} that stores secrets in
Ratis. This persists an
+ * application secret in Ratis but the deletion of that secret happens when
ApplicationLost is
+ * triggered.
*/
-public class MasterSecretRegistryImpl extends SecretRegistryImpl {
+public class MasterSecretRegistryImpl implements SecretRegistry {
private static final Logger LOG =
LoggerFactory.getLogger(MasterSecretRegistryImpl.class);
- private IMetadataHandler metadataHandler;
+ private AbstractMetaManager statusSystem;
@Override
public void register(String appId, String secret) {
- super.register(appId, secret);
- if (metadataHandler != null) {
- LOG.info("Persisting metadata for appId: {}", appId);
- metadataHandler.handleApplicationMeta(new ApplicationMeta(appId,
secret));
+ LOG.info("Persisting metadata for appId: {}", appId);
+ statusSystem.handleApplicationMeta(new ApplicationMeta(appId, secret));
+ }
+
+ @Override
+ public void unregister(String appId) {
+ LOG.info("Removing metadata for appId: {}", appId);
+ statusSystem.removeApplicationMeta(appId);
+ }
+
+ @Override
+ public String getSecretKey(String appId) {
+ String secret = null;
+ LOG.debug("Fetching secret from metadata manager for appId: {}", appId);
+ ApplicationMeta applicationMeta = statusSystem.applicationMetas.get(appId);
+ if (applicationMeta != null) {
+ secret = applicationMeta.secret();
}
+ return secret;
+ }
+
+ @Override
+ public boolean isRegistered(String appId) {
+ LOG.info("Fetching registration status from metadata manager for appId:
{}", appId);
+ return statusSystem.applicationMetas.containsKey(appId);
}
- void setMetadataHandler(IMetadataHandler metadataHandler) {
- this.metadataHandler = metadataHandler;
+ void setMetadataHandler(AbstractMetaManager statusSystem) {
+ this.statusSystem = statusSystem;
}
}
diff --git
a/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
b/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
index 5c36dd23f..67f41038c 100644
---
a/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
+++
b/master/src/main/java/org/apache/celeborn/service/deploy/master/clustermeta/AbstractMetaManager.java
@@ -538,6 +538,10 @@ public abstract class AbstractMetaManager implements
IMetadataHandler {
applicationMetas.putIfAbsent(applicationMeta.appId(), applicationMeta);
}
+ public void removeApplicationMeta(String appId) {
+ applicationMetas.remove(appId);
+ }
+
public int registeredShuffleCount() {
return
registeredAppAndShuffles.values().stream().mapToInt(Set::size).sum();
}