[
https://issues.apache.org/jira/browse/STORM-3984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Masatake Iwasaki updated STORM-3984:
------------------------------------
Description:
Since the [TimeOutWorkerHeartbeatsRecoveryStrategy.startTimeSecs is set on
Nimbus
start-up|https://github.com/apache/storm/blob/4374fdc93982db3e511ed5e8af642eef97c89aae/storm-server/src/main/java/org/apache/storm/nimbus/TimeOutWorkerHeartbeatsRecoveryStrategy.java#L48]
and never
updated,[TimeOutWorkerHeartbeatsRecoveryStrategy#exceedsMaxTimeOut|https://github.com/apache/storm/blob/4374fdc93982db3e511ed5e8af642eef97c89aae/storm-server/src/main/java/org/apache/storm/nimbus/TimeOutWorkerHeartbeatsRecoveryStrategy.java#L68-L70]
always returns true after 600s (the value of
supervisor.worker.heartbeats.max.timeout.secs) are passed after Nimbus is
started.
Invalid timeout in new leader Nimbus causes unnecessary reassign right after
failover.
{noformat}
2023-09-25 15:16:46.538 o.a.s.n.NimbusInfo main-EventThread [INFO] Nimbus
figures out its name to h02
2023-09-25 15:16:46.549 o.a.s.n.LeaderListenerCallback main-EventThread [INFO]
Sync remote assignments and id-info to local
2023-09-25 15:16:46.571 o.a.s.n.LeaderListenerCallback main-EventThread [INFO]
active-topology-blobs [word-count-1-1695654263] local-topology-blobs
[word-count-1-1695654263-stormconf.ser,word-count-1-1695654263-stormjar.jar,word-count-1-1695654263-stormcode.ser]
diff-topology-blobs []
2023-09-25 15:16:46.596 o.a.s.n.LeaderListenerCallback main-EventThread [INFO]
active-topology-dependencies [] local-blobs
[word-count-1-1695654263-stormconf.ser,word-count-1-1695654263-stormjar.jar,word-count-1-1695654263-stormcode.ser]
diff-topology-dependencies []
2023-09-25 15:16:46.596 o.a.s.n.LeaderListenerCallback main-EventThread [INFO]
Accepting leadership, all active topologies and corresponding dependencies
found locally.
2023-09-25 15:16:46.596 o.a.s.z.LeaderListenerCallbackFactory main-EventThread
[INFO] h02 gained leadership.
2023-09-25 15:16:46.744 o.a.s.n.TimeOutWorkerHeartbeatsRecoveryStrategy timer
[WARN] Failed to recover heartbeats for nodes:
[c26e72ef-b84b-4d44-820a-fec9407e38cf-172.18.0.11,
57ff205e-6d90-4305-abb8-b9ff0ff7bcc3-172.18.0.13,
f10f6554-0e55-4c01-a6ce-834df068d753-172.18.0.12] with timeout 600s
2023-09-25 15:16:46.807 o.a.s.d.n.HeartbeatCache timer [INFO] Executor
word-count-1-1695654263:[8, 8] not alive
2023-09-25 15:16:46.808 o.a.s.d.n.HeartbeatCache timer [INFO] Executor
word-count-1-1695654263:[16, 16] not alive
...(snip)
2023-09-25 15:16:46.862 o.a.s.d.n.Nimbus timer [INFO] Reassigning
word-count-1-1695654263 to 3 slots
2023-09-25 15:16:46.862 o.a.s.d.n.Nimbus timer [INFO] Reassign executors: [[20,
20], [14, 14], [12, 12], [16, 16], [18, 18], [28, 28], [26, 26], [10, 10], [8,
8], [24, 24], [6, 6], [22, 22], [2, 2], [4, 4], [13, 13], [11, 11], [7, 7], [9,
9], [19, 19], [23, 23], [21, 21], [25, 25], [27, 27], [5, 5]\
, [1, 1], [3, 3], [15, 15], [17, 17]]
{noformat}
was:
Since the [TimeOutWorkerHeartbeatsRecoveryStrategy.startTimeSecs is set on
Nimbus
start-up|https://github.com/apache/storm/blob/4374fdc93982db3e511ed5e8af642eef97c89aae/storm-server/src/main/java/org/apache/storm/nimbus/TimeOutWorkerHeartbeatsRecoveryStrategy.java#L48]
and never
updated,[TimeOutWorkerHeartbeatsRecoveryStrategy#exceedsMaxTimeOut|https://github.com/apache/storm/blob/4374fdc93982db3e511ed5e8af642eef97c89aae/storm-server/src/main/java/org/apache/storm/nimbus/TimeOutWorkerHeartbeatsRecoveryStrategy.java#L68-L70]
always returns true after 600s (the value of supervisor.w\
orker.heartbeats.max.timeout.secs) are passed after Nimbus is started.
Invalid timeout in new leader Nimbus causes unnecessary reassign right after
failover.
{noformat}
2023-09-25 15:16:46.538 o.a.s.n.NimbusInfo main-EventThread [INFO] Nimbus
figures out its name to h02
2023-09-25 15:16:46.549 o.a.s.n.LeaderListenerCallback main-EventThread [INFO]
Sync remote assignments and id-info to local
2023-09-25 15:16:46.571 o.a.s.n.LeaderListenerCallback main-EventThread [INFO]
active-topology-blobs [word-count-1-1695654263] local-topology-blobs
[word-count-1-1695654263-stormconf.ser,word-count-1-1695654263-stormjar.jar,word-count-1-1695654263-stormcode.ser]
diff-topology-blobs []
2023-09-25 15:16:46.596 o.a.s.n.LeaderListenerCallback main-EventThread [INFO]
active-topology-dependencies [] local-blobs
[word-count-1-1695654263-stormconf.ser,word-count-1-1695654263-stormjar.jar,word-count-1-1695654263-stormcode.ser]
diff-topology-dependencies []
2023-09-25 15:16:46.596 o.a.s.n.LeaderListenerCallback main-EventThread [INFO]
Accepting leadership, all active topologies and corresponding dependencies
found locally.
2023-09-25 15:16:46.596 o.a.s.z.LeaderListenerCallbackFactory main-EventThread
[INFO] h02 gained leadership.
2023-09-25 15:16:46.744 o.a.s.n.TimeOutWorkerHeartbeatsRecoveryStrategy timer
[WARN] Failed to recover heartbeats for nodes:
[c26e72ef-b84b-4d44-820a-fec9407e38cf-172.18.0.11,
57ff205e-6d90-4305-abb8-b9ff0ff7bcc3-172.18.0.13,
f10f6554-0e55-4c01-a6ce-834df068d753-172.18.0.12] with timeout 600s
2023-09-25 15:16:46.807 o.a.s.d.n.HeartbeatCache timer [INFO] Executor
word-count-1-1695654263:[8, 8] not alive
2023-09-25 15:16:46.808 o.a.s.d.n.HeartbeatCache timer [INFO] Executor
word-count-1-1695654263:[16, 16] not alive
...(snip)
2023-09-25 15:16:46.862 o.a.s.d.n.Nimbus timer [INFO] Reassigning
word-count-1-1695654263 to 3 slots
2023-09-25 15:16:46.862 o.a.s.d.n.Nimbus timer [INFO] Reassign executors: [[20,
20], [14, 14], [12, 12], [16, 16], [18, 18], [28, 28], [26, 26], [10, 10], [8,
8], [24, 24], [6, 6], [22, 22], [2, 2], [4, 4], [13, 13], [11, 11], [7, 7], [9,
9], [19, 19], [23, 23], [21, 21], [25, 25], [27, 27], [5, 5]\
, [1, 1], [3, 3], [15, 15], [17, 17]]
{noformat}
> Nimbus failover causes unnecessary reassign if 600s are passed after starting
> Nimbus
> ------------------------------------------------------------------------------------
>
> Key: STORM-3984
> URL: https://issues.apache.org/jira/browse/STORM-3984
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-server
> Affects Versions: 2.5.0
> Reporter: Masatake Iwasaki
> Assignee: Masatake Iwasaki
> Priority: Major
> Fix For: 2.6.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Since the [TimeOutWorkerHeartbeatsRecoveryStrategy.startTimeSecs is set on
> Nimbus
> start-up|https://github.com/apache/storm/blob/4374fdc93982db3e511ed5e8af642eef97c89aae/storm-server/src/main/java/org/apache/storm/nimbus/TimeOutWorkerHeartbeatsRecoveryStrategy.java#L48]
> and never
> updated,[TimeOutWorkerHeartbeatsRecoveryStrategy#exceedsMaxTimeOut|https://github.com/apache/storm/blob/4374fdc93982db3e511ed5e8af642eef97c89aae/storm-server/src/main/java/org/apache/storm/nimbus/TimeOutWorkerHeartbeatsRecoveryStrategy.java#L68-L70]
> always returns true after 600s (the value of
> supervisor.worker.heartbeats.max.timeout.secs) are passed after Nimbus is
> started.
> Invalid timeout in new leader Nimbus causes unnecessary reassign right after
> failover.
> {noformat}
> 2023-09-25 15:16:46.538 o.a.s.n.NimbusInfo main-EventThread [INFO] Nimbus
> figures out its name to h02
> 2023-09-25 15:16:46.549 o.a.s.n.LeaderListenerCallback main-EventThread
> [INFO] Sync remote assignments and id-info to local
> 2023-09-25 15:16:46.571 o.a.s.n.LeaderListenerCallback main-EventThread
> [INFO] active-topology-blobs [word-count-1-1695654263] local-topology-blobs
> [word-count-1-1695654263-stormconf.ser,word-count-1-1695654263-stormjar.jar,word-count-1-1695654263-stormcode.ser]
> diff-topology-blobs []
> 2023-09-25 15:16:46.596 o.a.s.n.LeaderListenerCallback main-EventThread
> [INFO] active-topology-dependencies [] local-blobs
> [word-count-1-1695654263-stormconf.ser,word-count-1-1695654263-stormjar.jar,word-count-1-1695654263-stormcode.ser]
> diff-topology-dependencies []
> 2023-09-25 15:16:46.596 o.a.s.n.LeaderListenerCallback main-EventThread
> [INFO] Accepting leadership, all active topologies and corresponding
> dependencies found locally.
> 2023-09-25 15:16:46.596 o.a.s.z.LeaderListenerCallbackFactory
> main-EventThread [INFO] h02 gained leadership.
> 2023-09-25 15:16:46.744 o.a.s.n.TimeOutWorkerHeartbeatsRecoveryStrategy timer
> [WARN] Failed to recover heartbeats for nodes:
> [c26e72ef-b84b-4d44-820a-fec9407e38cf-172.18.0.11,
> 57ff205e-6d90-4305-abb8-b9ff0ff7bcc3-172.18.0.13,
> f10f6554-0e55-4c01-a6ce-834df068d753-172.18.0.12] with timeout 600s
> 2023-09-25 15:16:46.807 o.a.s.d.n.HeartbeatCache timer [INFO] Executor
> word-count-1-1695654263:[8, 8] not alive
> 2023-09-25 15:16:46.808 o.a.s.d.n.HeartbeatCache timer [INFO] Executor
> word-count-1-1695654263:[16, 16] not alive
> ...(snip)
> 2023-09-25 15:16:46.862 o.a.s.d.n.Nimbus timer [INFO] Reassigning
> word-count-1-1695654263 to 3 slots
> 2023-09-25 15:16:46.862 o.a.s.d.n.Nimbus timer [INFO] Reassign executors:
> [[20, 20], [14, 14], [12, 12], [16, 16], [18, 18], [28, 28], [26, 26], [10,
> 10], [8, 8], [24, 24], [6, 6], [22, 22], [2, 2], [4, 4], [13, 13], [11, 11],
> [7, 7], [9, 9], [19, 19], [23, 23], [21, 21], [25, 25], [27, 27], [5, 5]\
> , [1, 1], [3, 3], [15, 15], [17, 17]]
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)