[ 
https://issues.apache.org/jira/browse/IGNITE-21307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808605#comment-17808605
 ] 

Mirza Aliev edited comment on IGNITE-21307 at 1/19/24 11:40 AM:
----------------------------------------------------------------

Once this (https://issues.apache.org/jira/browse/IGNITE-20452) ticket will be 
implemented, we can start the current ticket and we will need to call the 
corresponding method from the new FailureHandler when watch processing is 
failed.


was (Author: maliev):
Once this ticket will be implemented, we can start the current ticket and we 
will need to call the corresponding method from the new FailureHandler when 
watch processing is failed.

> Drop the node in case of failure in watch listener
> --------------------------------------------------
>
>                 Key: IGNITE-21307
>                 URL: https://issues.apache.org/jira/browse/IGNITE-21307
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Denis Chudov
>            Priority: Major
>              Labels: ignite-3
>
> For the linearized watch processing, we have 
> WatchProcessor#notificationFuture that is rewritten for each revision 
> processing and meta storage safe time advance. If some watch processor 
> completes exceptionally, this means that no further updates will be 
> processed, because they need the previous updates to be processed 
> successfully. This is implemented in futures chaining like this:
>  
> {code:java}
> notificationFuture = notificationFuture
>         .thenRunAsync(() -> revisionCallback.onSafeTimeAdvanced(time), 
> watchExecutor)
>         .whenComplete((ignored, e) -> {
>             if (e != null) {
>                 LOG.error("Error occurred when notifying safe time advanced 
> callback", e);
>             }
>         }); {code}
> For now, we dont have any failure handing of exceptionally completed 
> notification future. It leads to the endless log records with the same 
> exception's stack trace, caused by meta storage safe time advances:
>  
> {code:java}
> [2024-01-16T21:42:35,515][ERROR][%isot_n_0%JRaft-FSMCaller-Disruptor-metastorage-_stripe_0-0][WatchProcessor]
>  Error occurred when notifying safe time advanced callback
> java.util.concurrent.CompletionException: 
> org.apache.ignite.internal.lang.IgniteInternalException: IGN-CMN-65535 
> TraceId:3877e098-6a1b-4f30-88a8-a4c13411d573 Peers are not ready 
> [groupId=5_part_0]
>     at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
>  ~[?:?]
>     at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
>  ~[?:?]
>     at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1081)
>  ~[?:?]
>     at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  ~[?:?]
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  ~[?:?]
>     at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: org.apache.ignite.internal.lang.IgniteInternalException: Peers are 
> not ready [groupId=5_part_0]
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:725)
>  ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.randomNode(RaftGroupServiceImpl.java:709)
>  ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.refreshLeader(RaftGroupServiceImpl.java:234)
>  ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.RaftGroupServiceImpl.start(RaftGroupServiceImpl.java:190)
>  ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupService.start(TopologyAwareRaftGroupService.java:187)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.client.TopologyAwareRaftGroupServiceFactory.startRaftGroupService(TopologyAwareRaftGroupServiceFactory.java:73)
>  ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.raft.Loza.startRaftGroupService(Loza.java:350) 
> ~[ignite-raft-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$27(TableManager.java:917)
>  ~[ignite-table-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.util.IgniteUtils.inBusyLock(IgniteUtils.java:827) 
> ~[ignite-core-9.0.127-SNAPSHOT.jar:?]
>     at 
> org.apache.ignite.internal.table.distributed.TableManager.lambda$startPartitionAndStartClient$28(TableManager.java:913)
>  ~[ignite-table-9.0.127-SNAPSHOT.jar:?]
>     at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1072)
>  ~[?:?]
>     ... 4 more {code}
> So, the node can't operate properly and just produces tons of logs. Such 
> nodes should be halted.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to