[ 
https://issues.apache.org/jira/browse/IGNITE-17959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634723#comment-17634723
 ] 

Sergey Uttsel commented on IGNITE-17959:
----------------------------------------

We decide to rework current implementation to avoid creating new executor.

# All the logic described below is supposed to be implemented inside 
ReplicaService and ReplicaManager.
# We send an invoke, which can end with a ReplicaUnavailableException
## If we got a ReplicaUnavailableException, then
## We create a future on which the creation of a replica will be awaited by all 
subsequent invocations to this replica. Future creations are a point of 
synchronization, we can have other invokes that need to wait one the future. So 
in the replicaService we have a map into which we add futures through the 
compute method.
## We add a retry of the original invoke on this future.
## After creating the future, we send a new awaitReplicaRequest request to the 
replica.
## On receiving awaitReplicaRequest, ReplicaManager thread-safely checks if the 
replica is ready, and if it is not ready, it registers a listener on it.
## On the listener triggering, the ReplicaManager sends a response back to the 
ReplicaService. We can do this approach to reactive, but for now we will not do 
this. In case of activation of the listener, the listener is removed.
## On receiving of the awaitReplicaRequest response that the replica is ready 
the future on which the invoke hung is compliting. Invokes are sending.
## We can also get a timeout - upon the fact of the timeout, we send a request 
to remove the listener.

> ReplicaUnavailableException: Replica is not ready.
> --------------------------------------------------
>
>                 Key: IGNITE-17959
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17959
>             Project: Ignite
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 3.0.0-alpha5
>            Reporter: Evgeny Stanilovsky
>            Assignee: Sergey Uttsel
>            Priority: Major
>              Labels: ignite-3
>         Attachments: err.log, err2.log
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> h2. *Motivation*
> Frequently in long running tests can be observed (full msg in attach) :
> {noformat}
> Caused by: 
> org.apache.ignite.internal.replicator.exception.ReplicaUnavailableException: 
> IGN-REP-5 TraceId:82267e0a-aca2-47a3-806e-7922ed61d6d3 Replica is not ready 
> [replicationGroupId=b5b3a2e5-1342-4a90-97b3-a46e9509a1d6_part_5, 
> nodeName=iist_n_1]
> {noformat}
> check for example test: ItIndexSpoolTest.test, numerous runs or run until 
> failure will highlight the problem.
> Additionally we can observe (err2.log attached):
> {noformat}
> 2022-10-24 13:23:52:308 +0300 
> [WARNING][%iist_n_1%Raft-Group-Client-4][RaftGroupServiceImpl] Recoverable 
> error during the request type=ActionRequestImpl occurred (will be retried on 
> the randomly selected node): 
> java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException
>       at 
> java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>       at 
> java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
>       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.util.concurrent.TimeoutException
>       ... 7 more
> 2022-10-24 13:24:13:437 +0300 
> [WARNING][%iist_n_1%Raft-Group-Client-2][RaftGroupServiceImpl] Recoverable 
> error during the request type=ActionRequestImpl occurred (will be retried on 
> the randomly selected node): 
> java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException
>       at 
> java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
>       at 
> java.base/java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
>       at 
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
>       at 
> java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
>       at 
> java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
>       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>       at 
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: java.util.concurrent.TimeoutException
>       ... 7 more
> {noformat}
> h2. *Implementation Notes*
> The root cause is that there is a race between a table creation and an 
> operation on the table. So it's possible that Replica isn't created in 
> ReplicaManager when rw/ro operation is invoked. It's need to retry the 
> operation several times if ReplicaUnavailableException is thrown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to