Vladislav Pyatkov created IGNITE-20723:
------------------------------------------
Summary: Tests fail on TC because a primary replica is not
assigned or does not respond
Key: IGNITE-20723
URL: https://issues.apache.org/jira/browse/IGNITE-20723
Project: Ignite
Issue Type: Bug
Reporter: Vladislav Pyatkov
TC run is available
[here|https://ci.ignite.apache.org/buildConfiguration/ApacheIgnite3xGradle_Test_IntegrationTests_ModuleRunnerSqlLogic/7584713?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildProblemsSection=true&expandBuildChangesSection=true&expandBuildTestsSection=true].
By my brig analysis, the issue is somewhere in the assignments:
{noformat}
[2023-10-06T13:03:51,231][INFO
][%sqllogic0%Raft-Group-Client-1][PlacementDriverManager] Placement driver
active actor is starting.
[2023-10-06T13:08:38,981][INFO
][%sqllogic1%MessagingService-inbound--0][ReplicaManager] Received
LeaseGrantedMessage for replica belonging to group=291_part_3, force=false
[2023-10-06T13:08:38,981][INFO
][%sqllogic1%MessagingService-inbound--0][ReplicaManager] Waiting for actual
storage state, group=291_part_3
[2023-10-06T13:08:38,981][INFO
][%sqllogic1%JRaft-Request-Processor-5][ReplicaManager] Lease accepted,
group=291_part_3, leaseStartTime=HybridTimestamp [time=111188228107862067,
physical=1696597718931, logical=51], leaseExpirationTime=HybridTimestamp
[time=111188235972182016, physical=1696597838931, logical=0]
[2023-10-06T13:08:50,256][WARN
][CompletableFutureDelayScheduler][RaftGroupServiceImpl] Recoverable error
during the request type=ActionRequestImpl occurred (will be retried on the
randomly selected node):
java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
at
java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
~[?:?]
at
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
~[?:?]
at
java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:1019)
~[?:?]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
[?:?]
at
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
[?:?]
at
java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
[?:?]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
[?:?]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
[?:?]
at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.util.concurrent.TimeoutException
... 7 more
at
app//org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:42)
at app//org.junit.jupiter.api.Assertions.fail(Assertions.java:147)
at
app//org.apache.ignite.internal.sqllogic.Statement.execute(Statement.java:112)
at
app//org.apache.ignite.internal.sqllogic.SqlScriptRunner.run(SqlScriptRunner.java:70)
at
app//org.junit.jupiter.api.AssertTimeoutPreemptively.lambda$assertTimeoutPreemptively$0(AssertTimeoutPreemptively.java:48)
at
app//org.junit.jupiter.api.AssertTimeoutPreemptively.lambda$submitTask$3(AssertTimeoutPreemptively.java:95)
at
[email protected]/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
[email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
[email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at [email protected]/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.ignite.tx.TransactionException: IGN-REP-3
TraceId:0002850d-2b24-4356-a0dc-2c25af4202a1
org.apache.ignite.internal.replicator.exception.ReplicationTimeoutException:
IGN-REP-3 TraceId:0002850d-2b24-4356-a0dc-2c25af4202a1 Replication is timed out
[replicaGrpId=291_part_3]
at
[email protected]/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710)
at
app//org.apache.ignite.internal.util.ExceptionUtils$1.copy(ExceptionUtils.java:772)
at
app//org.apache.ignite.internal.util.ExceptionUtils$ExceptionFactory.createCopy(ExceptionUtils.java:706)
at
app//org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:543)
at
app//org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCauseInternal(ExceptionUtils.java:641)
at
app//org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:494)
at
app//org.apache.ignite.internal.sql.AbstractSession.execute(AbstractSession.java:63)
at
app//org.apache.ignite.internal.sqllogic.ScriptContext.executeQuery(ScriptContext.java:87)
at
app//org.apache.ignite.internal.sqllogic.Statement.execute(Statement.java:110)
... 7 more
Caused by: java.util.concurrent.CompletionException:
org.apache.ignite.tx.TransactionException: IGN-REP-3
TraceId:0002850d-2b24-4356-a0dc-2c25af4202a1
org.apache.ignite.internal.replicator.exception.ReplicationTimeoutException:
IGN-REP-3 TraceId:0002850d-2b24-4356-a0dc-2c25af4202a1 Replication is timed out
[replicaGrpId=291_part_3]
at
org.apache.ignite.internal.sql.api.SessionImpl.lambda$executeAsync$3(SessionImpl.java:208)
at
java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
at
java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
at
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
at
org.apache.ignite.internal.sql.engine.exec.rel.AsyncRootNode.lambda$closeAsync$0(AsyncRootNode.java:157)
at
java.base/java.util.concurrent.ConcurrentLinkedQueue.forEachFrom(ConcurrentLinkedQueue.java:1037)
at
java.base/java.util.concurrent.ConcurrentLinkedQueue.forEach(ConcurrentLinkedQueue.java:1054)
at
org.apache.ignite.internal.sql.engine.exec.rel.AsyncRootNode.closeAsync(AsyncRootNode.java:157)
at
org.apache.ignite.internal.sql.engine.exec.rel.AsyncRootNode.onError(AsyncRootNode.java:112)
at
org.apache.ignite.internal.sql.engine.exec.ExecutionServiceImpl$DistributedQueryManager.lambda$onError$2(ExecutionServiceImpl.java:521)
at
java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:753)
at
java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:731)
at
java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2108)
at
org.apache.ignite.internal.sql.engine.exec.ExecutionServiceImpl$DistributedQueryManager.onError(ExecutionServiceImpl.java:520)
at
org.apache.ignite.internal.sql.engine.exec.ExecutionServiceImpl.onMessage(ExecutionServiceImpl.java:372)
at
org.apache.ignite.internal.sql.engine.exec.ExecutionServiceImpl.lambda$start$4(ExecutionServiceImpl.java:219)
at
org.apache.ignite.internal.sql.engine.message.MessageServiceImpl.onMessageInternal(MessageServiceImpl.java:139)
at
org.apache.ignite.internal.sql.engine.message.MessageServiceImpl.lambda$onMessage$1(MessageServiceImpl.java:110)
at
org.apache.ignite.internal.sql.engine.exec.QueryTaskExecutorImpl.lambda$execute$0(QueryTaskExecutorImpl.java:81)
... 3 more
Caused by: org.apache.ignite.tx.TransactionException: IGN-REP-3
TraceId:0002850d-2b24-4356-a0dc-2c25af4202a1
org.apache.ignite.internal.replicator.exception.ReplicationTimeoutException:
IGN-REP-3 TraceId:0002850d-2b24-4356-a0dc-2c25af4202a1 Replication is timed out
[replicaGrpId=291_part_3]
at
app//org.apache.ignite.internal.util.ExceptionUtils.lambda$withCause$0(ExceptionUtils.java:378)
at
app//org.apache.ignite.internal.util.ExceptionUtils.withCauseInternal(ExceptionUtils.java:461)
at
app//org.apache.ignite.internal.util.ExceptionUtils.withCause(ExceptionUtils.java:378)
at
app//org.apache.ignite.internal.tx.impl.IgniteAbstractTransactionImpl.rollback(IgniteAbstractTransactionImpl.java:91)
at
app//org.apache.ignite.internal.sql.engine.QueryTransactionWrapper.rollbackImplicit(QueryTransactionWrapper.java:56)
at
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$requestNextAsync$0(AsyncSqlCursorImpl.java:74)
at
[email protected]/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
at
[email protected]/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
... 20 more
Caused by: java.util.concurrent.ExecutionException:
org.apache.ignite.internal.replicator.exception.ReplicationTimeoutException:
IGN-REP-3 TraceId:0002850d-2b24-4356-a0dc-2c25af4202a1 Replication is timed out
[replicaGrpId=291_part_3]
at
java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
at
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
at
org.apache.ignite.internal.tx.impl.IgniteAbstractTransactionImpl.rollback(IgniteAbstractTransactionImpl.java:89)
... 24 more
Caused by:
org.apache.ignite.internal.replicator.exception.ReplicationTimeoutException:
IGN-REP-3 TraceId:0002850d-2b24-4356-a0dc-2c25af4202a1 Replication is timed out
[replicaGrpId=291_part_3]
at
app//org.apache.ignite.internal.replicator.ReplicaService.lambda$sendToReplica$3(ReplicaService.java:97)
at
[email protected]/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
at
[email protected]/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
at
[email protected]/java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:479)
at
[email protected]/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
at
[email protected]/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
at
[email protected]/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
at
[email protected]/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
at
[email protected]/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
{noformat}
The placement driver is initiated successfully, and the active actor is chosen.
Even the lease for the 291_part_3 group is accepted. But when the test code
tries to access the primary replica, it receives a timeout exception.
Similar issue with the 425_part_1. The group has not even started (I did not
find the masses in the log). And the test code cannot find the primary replica
through the placement driver API:
{noformat}
[2023-10-06T13:47:20,009][ERROR][sqllogic1-srv-worker-12][FailureHandler]
Critical failure
org.apache.ignite.lang.IgniteException: Handshake rejected by client:
sqllogic1:6745409c-b237-45f0-80ad-66ddf52c24bc is stale, server should be
restarted so that clients can connect
at
org.apache.ignite.internal.network.recovery.RecoveryServerHandshakeManager.onMessage(RecoveryServerHandshakeManager.java:197)
[ignite-network-3.0.0-SNAPSHOT.jar:?]
at
org.apache.ignite.internal.network.netty.HandshakeHandler.channelRead(HandshakeHandler.java:92)
[ignite-network-3.0.0-SNAPSHOT.jar:?]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
[netty-codec-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
[netty-codec-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
[netty-transport-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
[netty-common-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
[netty-common-4.1.87.Final.jar:4.1.87.Final]
at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
[netty-common-4.1.87.Final.jar:4.1.87.Final]
at java.lang.Thread.run(Thread.java:834) [?:?]
at
app//org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:42)
at app//org.junit.jupiter.api.Assertions.fail(Assertions.java:147)
at
app//org.apache.ignite.internal.sqllogic.Statement.execute(Statement.java:112)
at
app//org.apache.ignite.internal.sqllogic.SqlScriptRunner.run(SqlScriptRunner.java:70)
at
app//org.junit.jupiter.api.AssertTimeoutPreemptively.lambda$assertTimeoutPreemptively$0(AssertTimeoutPreemptively.java:48)
at
app//org.junit.jupiter.api.AssertTimeoutPreemptively.lambda$submitTask$3(AssertTimeoutPreemptively.java:95)
at
[email protected]/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
[email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
[email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at [email protected]/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.ignite.lang.IgniteException: IGN-CMN-65535
TraceId:aa32c22f-fc1b-4d04-8191-3753368c62a8 The primary replica await timed
out [replicationGroupId=425_part_1, referenceTimestamp=HybridTimestamp
[time=111188380104982529, physical=1696600038223, logical=1], currentLease=null]
at
[email protected]/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710)
at
app//org.apache.ignite.internal.util.ExceptionUtils$1.copy(ExceptionUtils.java:772)
at
app//org.apache.ignite.internal.util.ExceptionUtils$ExceptionFactory.createCopy(ExceptionUtils.java:706)
at
app//org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:543)
at
app//org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCauseInternal(ExceptionUtils.java:641)
at
app//org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:494)
at
app//org.apache.ignite.internal.sql.AbstractSession.execute(AbstractSession.java:63)
at
app//org.apache.ignite.internal.sqllogic.ScriptContext.executeQuery(ScriptContext.java:87)
at
app//org.apache.ignite.internal.sqllogic.Statement.execute(Statement.java:110)
... 7 more
Caused by: java.util.concurrent.CompletionException:
org.apache.ignite.lang.IgniteException: IGN-CMN-65535
TraceId:aa32c22f-fc1b-4d04-8191-3753368c62a8 The primary replica await timed
out [replicationGroupId=425_part_1, referenceTimestamp=HybridTimestamp
[time=111188380104982529, physical=1696600038223, logical=1], currentLease=null]
at
org.apache.ignite.internal.sql.api.SessionImpl.lambda$executeAsync$3(SessionImpl.java:208)
at
java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
at
java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
at
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
at
org.apache.ignite.internal.sql.engine.exec.ExecutionServiceImpl$DistributedQueryManager.lambda$execute$17(ExecutionServiceImpl.java:643)
at
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
at
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
at
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
at
org.apache.ignite.internal.sql.engine.exec.QueryTaskExecutorImpl.lambda$execute$0(QueryTaskExecutorImpl.java:81)
... 3 more
Caused by: org.apache.ignite.lang.IgniteException: IGN-CMN-65535
TraceId:aa32c22f-fc1b-4d04-8191-3753368c62a8 The primary replica await timed
out [replicationGroupId=425_part_1, referenceTimestamp=HybridTimestamp
[time=111188380104982529, physical=1696600038223, logical=1], currentLease=null]
at
app//org.apache.ignite.internal.lang.IgniteExceptionMapperUtil.mapToPublicException(IgniteExceptionMapperUtil.java:110)
at
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.wrapIfNecessary(AsyncSqlCursorImpl.java:100)
at
app//org.apache.ignite.internal.sql.engine.AsyncSqlCursorImpl.lambda$requestNextAsync$0(AsyncSqlCursorImpl.java:76)
at
[email protected]/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
at
[email protected]/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
... 10 more
Caused by:
org.apache.ignite.internal.placementdriver.PrimaryReplicaAwaitTimeoutException:
IGN-PLACEMENTDRIVER-1 TraceId:aa32c22f-fc1b-4d04-8191-3753368c62a8 The primary
replica await timed out [replicationGroupId=425_part_1,
referenceTimestamp=HybridTimestamp [time=111188380104982529,
physical=1696600038223, logical=1], currentLease=null]
at
app//org.apache.ignite.internal.placementdriver.leases.LeaseTracker.lambda$awaitPrimaryReplica$2(LeaseTracker.java:229)
at
[email protected]/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
at
[email protected]/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
at
[email protected]/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at
[email protected]/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
at
[email protected]/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2792)
at
[email protected]/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at
[email protected]/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
[email protected]/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
... 3 more
Caused by: java.util.concurrent.TimeoutException
... 7 more
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)