[jira] [Commented] (IGNITE-21307) Call failure handler in case of failure in WatchProcessor

Denis Chudov (Jira) Thu, 04 Apr 2024 06:44:08 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-21307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833934#comment-17833934
 ]


Denis Chudov commented on IGNITE-21307:
---------------------------------------

[~slava.koptilin] LGTM.

> Call failure handler in case of failure in WatchProcessor
> ---------------------------------------------------------
>
>                 Key: IGNITE-21307
>                 URL: https://issues.apache.org/jira/browse/IGNITE-21307
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Denis Chudov
>            Assignee: Vyacheslav Koptilin
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> For the linearized watch processing, we have 
> WatchProcessor#notificationFuture that is rewritten for each revision 
> processing and meta storage safe time advance. If some watch processor 
> completes exceptionally, this means that no further updates will be 
> processed, because they need the previous updates to be processed 
> successfully. This is implemented in futures chaining like this:
>  
> {code:java}
> notificationFuture = notificationFuture
>         .thenRunAsync(() -> revisionCallback.onSafeTimeAdvanced(time), 
> watchExecutor)
>         .whenComplete((ignored, e) -> {
>             if (e != null) {
>                 LOG.error("Error occurred when notifying safe time advanced 
> callback", e);
>             }
>         }); {code}
> For now, we dont have any failure handing of exceptionally completed 
> notification future. It leads to the endless log records with the same 
> exception's stack trace, caused by meta storage safe time advances:
>  
> {code:java}
> [2024-01-16T21:42:35,515][ERROR][%isot_n_0%JRaft-FSMCaller-Disruptor-metastorage-_stripe_0-0][WatchProcessor]
>  Error occurred when notifying safe time advanced callback
> java.util.concurrent.CompletionException: 
> org.apache.ignite.internal.lang.IgniteInternalException: IGN-CMN-65535 
> TraceId:3877e098-6a1b-4f30-88a8-a4c13411d573 Peers are not ready 
> [groupId=5_part_0]
>     at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
>  ~[?:?]
>     at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
>  ~[?:?]
>     at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1081)
>  ~[?:?]
>     at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: org.apache.ignite.internal.lang.IgniteInternalException: Peers are 
> not ready [groupId=5_part_0]
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:725)
>  ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:709)
>  ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.refreshLeader(RaftGroupServiceImpl.java:234)
>  ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.start(RaftGroupServiceImpl.java:190)
>  ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupService.start(TopologyAwareRaftGroupService.java:187)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupServiceFactory.startRaftGroupService(TopologyAwareRaftGroupServiceFactory.java:73)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.Loza.startRaftGroupService(Loza.java:350) 
> ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$27(TableManager.java:917)
>  ~[ignite-table-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:827) 
> ~[ignite-core-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$28(TableManager.java:913)
>  ~[ignite-table-9.0.127-SNAPSHOT.jar:?]
>     at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
>  ~[?:?]
>     ... 4 more {code}
> So, the node can't operate properly and just produces tons of logs. Such 
> nodes should be halted.
> UPD:
> We decided to just add invocation of {{failureProcessor.process}} in all 
> places in {{org.apache.ignite.internal.metastorage.server.WatchProcessor}} 
> where exceptions happen, like 
> {code:java}
>                 notifyWatchFuture = watchAndEvents.watch.onUpdate(event)
>                         .whenComplete((v, e) -> {
>                             if (e != null) {
>                                 if (e instanceof CompletionException) {
>                                     e = e.getCause();
>                                 }
>                                 if (!(e instanceof NodeStoppingException)) {
>                                     // TODO: IGNITE-14693 Implement Meta 
> storage exception handling
>                                     LOG.error("Error occurred when processing 
> a watch event", e);
>                                 }
>                                 watchAndEvents.watch.onError(e);
>                             }
>                         });
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (IGNITE-21307) Call failure handler in case of failure in WatchProcessor

Reply via email to