[
https://issues.apache.org/jira/browse/STORM-4077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878594#comment-17878594
]
Hugo Abreu commented on STORM-4077:
-----------------------------------
We have found the root cause. We'll update the description as soon as possible.
Essentially, by using the modTime as the version, we have found that, while
using the {{{}LocalFsBlobStoreFile{}}}, everytime the the Nimbus leader goes
down the following occurs:
# Nimbus (1) leader goes down and a new Nimbus (2) picks up the leadership.
# If blobs in Nimbus (2) have a different modTime workers are restarted (even
though they might be the same).
# Nimbus (1) comes back up, syncs the blobs in the startup and updates the
modTime, as it downloads the blobs again.
# If Nimbus (2) leader goes down, all the workers will be restarted again as
Nimbus (1) has new modTime again.
# This can be repeated endless as the modTime will always be different in each
Nimbus leader.
We suggest a new method that obtains the file version:
{code:java}
public abstract class BlobStoreFile {
public abstract long getModTime() throws IOException;
public long getVersion() throws IOException {
return getModTime();
}
}{code}
And defaults to the current approach if not implemented and the version of the
file would be something in the lines:
{code:java}
public long getVersion() throws IOException {
byte[] bytes = DigestUtils.sha1(new FileInputStream(path));
return Arrays.hashCode(bytes);
} {code}
Which still requires performance testing (for the input stream and array
hashCode computing), though we have generated a jar based of this change and we
fixed the issue, Nimbus leader going down no longer kills workers as the blob
version remains the same.
We'll open a PR when possible.
Cheers!
> Worker being reassigned when Nimbus leadership changes
> ------------------------------------------------------
>
> Key: STORM-4077
> URL: https://issues.apache.org/jira/browse/STORM-4077
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 2.6.1
> Reporter: Pedro Azevedo
> Priority: Major
>
> Hey guys, I'm using Storm v2.6.1 and every time I restart the nimbus leader
> (currently I have 3 for high availability) the workers get reassigned and
> this is a bad behaviour as every topology will have no workers running for a
> certain period(until new workers are assigned) due to a Nimbus leadership
> change.
> On another note, when stopping the nimbus I'm getting this error which seems
> to be impacting the gracefully shutdown.
> {code:java}
> 2024-08-21T15:09:46.647Z Nimbus [INFO] Shutting down master
> 2024-08-21T15:09:46.648Z CuratorFrameworkImpl [INFO] backgroundOperationsLoop
> exiting 2024-08-21T15:09:46.752Z ClientCnxn [INFO] EventThread shut down for
> session: 0x4000010be7caa5d 2024-08-21T15:09:46.752Z ZooKeeper [INFO] Session:
> 0x4000010be7caa5d closed 2024-08-21T15:09:46.752Z CuratorFrameworkImpl [INFO]
> backgroundOperationsLoop exiting 2024-08-21T15:09:46.812Z ProcessFunction
> [ERROR] Internal error processing getLeader java.lang.IllegalStateException:
> Expected state [STARTED] was [STOPPED] at
> org.apache.storm.shade.org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:835)
> ~[storm-shaded-deps-2.6.1.jar:2.6.1] at
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkState(CuratorFrameworkImpl.java:462)
> ~[storm-shaded-deps-2.6.1.jar:2.6.1] at
> org.apache.storm.shade.org.apache.curator.framework.imps.CuratorFrameworkImpl.getChildren(CuratorFrameworkImpl.java:507)
> ~[storm-shaded-deps-2.6.1.jar:2.6.1] at
> org.apache.storm.zookeeper.ClientZookeeper.getChildren(ClientZookeeper.java:209)
> ~[storm-client-2.6.1.jar:2.6.1] at
> org.apache.storm.cluster.ZKStateStorage.get_children(ZKStateStorage.java:155)
> ~[storm-client-2.6.1.jar:2.6.1] at
> org.apache.storm.cluster.StormClusterStateImpl.nimbuses(StormClusterStateImpl.java:279)
> ~[storm-client-2.6.1.jar:2.6.1] at
> org.apache.storm.daemon.nimbus.Nimbus.getLeader(Nimbus.java:4907)
> ~[storm-server-2.6.1.jar:2.6.1] at
> org.apache.storm.generated.Nimbus$Processor$getLeader.getResult(Nimbus.java:5168)
> ~[storm-client-2.6.1.jar:2.6.1] at
> org.apache.storm.generated.Nimbus$Processor$getLeader.getResult(Nimbus.java:5144)
> ~[storm-client-2.6.1.jar:2.6.1] at
> org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:40)
> [storm-shaded-deps-2.6.1.jar:2.6.1] at
> org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:40)
> [storm-shaded-deps-2.6.1.jar:2.6.1] at
> org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:171)
> [storm-client-2.6.1.jar:2.6.1] at
> org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:492)
> [storm-shaded-deps-2.6.1.jar:2.6.1] at
> org.apache.storm.thrift.server.Invocation.run(Invocation.java:19)
> [storm-shaded-deps-2.6.1.jar:2.6.1] at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?] at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?] at java.base/java.lang.Thread.run(Thread.java:829) [?:?]{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)