[
https://issues.apache.org/jira/browse/IGNITE-21307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833934#comment-17833934
]
Denis Chudov commented on IGNITE-21307:
---------------------------------------
[~slava.koptilin] LGTM.
> Call failure handler in case of failure in WatchProcessor
> ---------------------------------------------------------
>
> Key: IGNITE-21307
> URL: https://issues.apache.org/jira/browse/IGNITE-21307
> Project: Ignite
> Issue Type: Improvement
> Reporter: Denis Chudov
> Assignee: Vyacheslav Koptilin
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-beta2
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> For the linearized watch processing, we have
> WatchProcessor#notificationFuture that is rewritten for each revision
> processing and meta storage safe time advance. If some watch processor
> completes exceptionally, this means that no further updates will be
> processed, because they need the previous updates to be processed
> successfully. This is implemented in futures chaining like this:
>
> {code:java}
> notificationFuture = notificationFuture
> .thenRunAsync(() -> revisionCallback.onSafeTimeAdvanced(time),
> watchExecutor)
> .whenComplete((ignored, e) -> {
> if (e != null) {
> LOG.error("Error occurred when notifying safe time advanced
> callback", e);
> }
> }); {code}
> For now, we dont have any failure handing of exceptionally completed
> notification future. It leads to the endless log records with the same
> exception's stack trace, caused by meta storage safe time advances:
>
> {code:java}
> [2024-01-16T21:42:35,515][ERROR][%isot_n_0%JRaft-FSMCaller-Disruptor-metastorage-_stripe_0-0][WatchProcessor]
> Error occurred when notifying safe time advanced callback
> java.util.concurrent.CompletionException:
> org.apache.ignite.internal.lang.IgniteInternalException: IGN-CMN-65535
> TraceId:3877e098-6a1b-4f30-88a8-a4c13411d573 Peers are not ready
> [groupId=5_part_0]
> at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1081)
> ~[?:?]
> at
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
> ~[?:?]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> ~[?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> ~[?:?]
> at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: org.apache.ignite.internal.lang.IgniteInternalException: Peers are
> not ready [groupId=5_part_0]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:725)
> ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:709)
> ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.refreshLeader(RaftGroupServiceImpl.java:234)
> ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.start(RaftGroupServiceImpl.java:190)
> ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupService.start(TopologyAwareRaftGroupService.java:187)
> ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupServiceFactory.startRaftGroupService(TopologyAwareRaftGroupServiceFactory.java:73)
> ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.raft.Loza.startRaftGroupService(Loza.java:350)
> ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$27(TableManager.java:917)
> ~[ignite-table-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:827)
> ~[ignite-core-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$28(TableManager.java:913)
> ~[ignite-table-9.0.127-SNAPSHOT.jar:?]
> at
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
> ~[?:?]
> ... 4 more {code}
> So, the node can't operate properly and just produces tons of logs. Such
> nodes should be halted.
> UPD:
> We decided to just add invocation of {{failureProcessor.process}} in all
> places in {{org.apache.ignite.internal.metastorage.server.WatchProcessor}}
> where exceptions happen, like
> {code:java}
> notifyWatchFuture = watchAndEvents.watch.onUpdate(event)
> .whenComplete((v, e) -> {
> if (e != null) {
> if (e instanceof CompletionException) {
> e = e.getCause();
> }
> if (!(e instanceof NodeStoppingException)) {
> // TODO: IGNITE-14693 Implement Meta
> storage exception handling
> LOG.error("Error occurred when processing
> a watch event", e);
> }
> watchAndEvents.watch.onError(e);
> }
> });
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)