[
https://issues.apache.org/jira/browse/IGNITE-21307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808605#comment-17808605
]
Mirza Aliev edited comment on IGNITE-21307 at 1/19/24 11:39 AM:
----------------------------------------------------------------
Once this ticket will be implemented, we can start the current ticket and we
will need to call corresponding method from the new FailureHandler when watch
processing is failed.
was (Author: maliev):
Once [this|https://issues.apache.org/jira/browse/IGNITE-20452] ticket will be
implemented, we can start the current ticket and we will need to call
corresponding method from the new FailureHandler that will be propagated to all
components when watch processing is failed
> Drop the node in case of failure in watch listener
> --------------------------------------------------
>
> Key: IGNITE-21307
> URL: https://issues.apache.org/jira/browse/IGNITE-21307
> Project: Ignite
> Issue Type: Improvement
> Reporter: Denis Chudov
> Priority: Major
> Labels: ignite-3
>
> For the linearized watch processing, we have
> WatchProcessor#notificationFuture that is rewritten for each revision
> processing and meta storage safe time advance. If some watch processor
> completes exceptionally, this means that no further updates will be
> processed, because they need the previous updates to be processed
> successfully. This is implemented in futures chaining like this:
>
> {code:java}
> notificationFuture = notificationFuture
> .thenRunAsync(() -> revisionCallback.onSafeTimeAdvanced(time),
> watchExecutor)
> .whenComplete((ignored, e) -> {
> if (e != null) {
> LOG.error("Error occurred when notifying safe time advanced
> callback", e);
> }
> }); {code}
> For now, we dont have any failure handing of exceptionally completed
> notification future. It leads to the endless log records with the same
> exception's stack trace, caused by meta storage safe time advances:
>
> {code:java}
> [2024-01-16T21:42:35,515][ERROR][%isot_n_0%JRaft-FSMCaller-Disruptor-metastorage-_stripe_0-0][WatchProcessor]
> Error occurred when notifying safe time advanced callback
> java.util.concurrent.CompletionException:
> org.apache.ignite.internal.lang.IgniteInternalException: IGN-CMN-65535
> TraceId:3877e098-6a1b-4f30-88a8-a4c13411d573 Peers are not ready
> [groupId=5_part_0]
> at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1081)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
> ~[?:?]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> ~[?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> ~[?:?]
> at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: org.apache.ignite.internal.lang.IgniteInternalException: Peers are
> not ready [groupId=5_part_0]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:725)
> ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:709)
> ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.refreshLeader(RaftGroupServiceImpl.java:234)
> ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.start(RaftGroupServiceImpl.java:190)
> ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupService.start(TopologyAwareRaftGroupService.java:187)
> ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupServiceFactory.startRaftGroupService(TopologyAwareRaftGroupServiceFactory.java:73)
> ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.Loza.startRaftGroupService(Loza.java:350)
> ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$27(TableManager.java:917)
> ~[ignite-table-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:827)
> ~[ignite-core-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$28(TableManager.java:913)
> ~[ignite-table-9.0.127-SNAPSHOT.jar:?]
> at
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
> ~[?:?]
> ... 4 more {code}
> So, the node can't operate properly and just produces tons of logs. Such
> nodes should be halted.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)