[jira] [Updated] (IGNITE-22187) Cluster of 2 or 3 nodes doesn't work if one node is down

Igor (Jira) Wed, 08 May 2024 02:00:42 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-22187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Igor updated IGNITE-22187:
--------------------------
    Description: 
*Steps to reproduce:*
 # Create zone with replication equals to amount of nodes (2 or 3 corresponding)
 # Create 10 tables inside the zone.
 # Insert 100 rows in every table.
 # Await all tables*partitions*nodes local state is "HEALTHY"
 # Await all tables*partitions*nodes global state is "AVAILABLE"
 # Kill first node with kill -9.
 # Assert all tables*partitions*nodes local state is "HEALTHY"
 # Await all tables*partitions*nodes global state is "READ_ONLY" for 2 nodes 
cluster or "DEGRADED" for 3 nodes cluster,
 # Execute select query using JDBC connecting to the second node (which is 
alive).

*Expected:*

Data is returned.

*Actual:*
On the step 7 it returns error by REST API:
{code:java}
{"title":"Internal Server 
Error","status":500,"code":"IGN-RECOVERY-3","type":null,"detail":"io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: 
/172.120.6.2:3344","node":null,"traceId":"2acb52fc-3275-411b-a4de-45f14873f15c","invalidParams":null}{code}
In the server logs continuous errors:
{code:java}
2024-05-08 10:37:19:796 +0200 
[ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-9][AbstractClientService]
 Fail to connect ClusterFailover3NodesTest_cluster_0, exception: 
java.net.ConnectException.
2024-05-08 10:37:19:796 +0200 
[ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-9][ReplicatorGroupImpl]
 Fail to check replicator connection to 
peer=ClusterFailover3NodesTest_cluster_0, replicatorType=Follower.
2024-05-08 10:37:19:796 +0200 
[ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-12][AbstractClientService]
 Fail to connect ClusterFailover3NodesTest_cluster_0, exception: 
java.net.ConnectException.
2024-05-08 10:37:19:796 +0200 
[ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-12][ReplicatorGroupImpl]
 Fail to check replicator connection to 
peer=ClusterFailover3NodesTest_cluster_0, replicatorType=Follower. {code}
If skip steps 7 and 8, then the exception on step 9 occurs:
{code:java}
java.sql.SQLException: Unable to send fragment 
[targetNode=ClusterFailover3NodesTest_cluster_0, fragmentId=1, 
cause=io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection 
refused: no further information: /192.168.100.5:3344]
    at 
org.apache.ignite.internal.jdbc.proto.IgniteQueryErrorCode.createJdbcSqlException(IgniteQueryErrorCode.java:57)
    at 
org.apache.ignite.internal.jdbc.JdbcStatement.execute0(JdbcStatement.java:154)
    at 
org.apache.ignite.internal.jdbc.JdbcStatement.executeQuery(JdbcStatement.java:111)
    at 
org.gridgain.ai3tests.tests.teststeps.JdbcSteps.executeQuery(JdbcSteps.java:91)
    at 
org.gridgain.ai3tests.tests.failover.ClusterFailoverTestBase.getActualResult(ClusterFailoverTestBase.java:336)
    at 
org.gridgain.ai3tests.tests.failover.ClusterFailoverTestBase.assertDataIsFilledWithoutErrors(ClusterFailoverTestBase.java:154)
    at 
org.gridgain.ai3tests.tests.failover.ClusterFailover3NodesTest.singleKillAndCheckOtherNodeWorks(ClusterFailover3NodesTest.java:96)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834) {code}

  was:
*Steps to reproduce:*
 # Create zone with replication equals to amount of nodes (2 or 3 corresponding)
 # Create 10 tables inside the zone.
 # Insert 100 rows in every table.
 # Await all tables*partitions*nodes local state is "HEALTHY"
 # Await all tables*partitions*nodes global state is "AVAILABLE"
 # Kill first node with kill -9.
 # Assert all tables*partitions*nodes local state is "HEALTHY"
 # Await all tables*partitions*nodes global state is "READ_ONLY" for 2 nodes 
cluster or "DEGRADED" for 3 nodes cluster,
 # Execute select query using JDBC connecting to the second node (which is 
alive).

*Expected:*

Data is returned.

*Actual:*
The select query at step 9 freezes forever.
The errors on the server side:
{code:java}
2024-04-30 00:04:02:965 +0200 
[ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-8][AbstractClientService]
 Fail to connect ClusterFailover3NodesTest_cluster_0, exception: 
java.util.concurrent.TimeoutException.
2024-04-30 00:04:02:965 +0200 
[ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-8][ReplicatorGroupImpl]
 Fail to check replicator connection to 
peer=ClusterFailover3NodesTest_cluster_0, replicatorType=Follower.
2024-04-30 00:04:02:980 +0200 
[ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-Response-Processor-1][AbstractClientService]
 Fail to connect ClusterFailover3NodesTest_cluster_0, exception: 
java.util.concurrent.TimeoutException.
2024-04-30 00:04:02:980 +0200 
[ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-Response-Processor-1][ReplicatorGroupImpl]
 Fail to check replicator connection to 
peer=ClusterFailover3NodesTest_cluster_0, replicatorType=Follower.
2024-04-30 00:04:02:981 +0200 
[ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-Response-Processor-1][NodeImpl]
 Fail to add a replicator, peer=ClusterFailover3NodesTest_cluster_0.
2024-04-30 00:04:02:981 +0200 
[WARNING][ClusterFailover3NodesTest_cluster_1-client-8][RaftGroupServiceImpl] 
Recoverable error during the request occurred (will be retried on the randomly 
selected node) [request=WriteActionRequestImpl [command=[0, 9, 41, -117, -128, 
-8, -15, -83, -4, -54, -57, 1], deserializedCommand=SafeTimeSyncCommandImpl 
[safeTimeLong=112356769098760202], groupId=26_part_10], peer=Peer 
[consistentId=ClusterFailover3NodesTest_cluster_0, idx=0], newPeer=Peer 
[consistentId=ClusterFailover3NodesTest_cluster_1, idx=0]].
java.util.concurrent.CompletionException: 
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 
no further information: /192.168.100.5:3344
  at 
java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:367)
  at 
java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:376)
  at 
java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1074)
  at 
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
  at 
java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
  at 
org.apache.ignite.internal.network.netty.NettyUtils.lambda$toCompletableFuture$0(NettyUtils.java:74)
  at 
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:590)
  at 
io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:583)
  at 
io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:559)
  at 
io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:492)
  at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:636)
  at 
io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:629)
  at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:118)
  at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:326)
  at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:342)
  at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
  at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
  at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
  at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
  at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
  at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: 
Connection refused: no further information: /192.168.100.5:3344
Caused by: java.net.ConnectException: Connection refused: no further information
  at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
  at 
java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779)
  at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
  at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:339)
  at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
  at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
  at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
  at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
  at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
  at java.base/java.lang.Thread.run(Thread.java:834)
2024-04-30 00:04:02:982 +0200 
[ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-0][AbstractClientService]
 Fail to connect ClusterFailover3NodesTest_cluster_0, exception: 
java.util.concurrent.TimeoutException. {code}


> Cluster of 2 or 3 nodes doesn't work if one node is down
> --------------------------------------------------------
>
>                 Key: IGNITE-22187
>                 URL: https://issues.apache.org/jira/browse/IGNITE-22187
>             Project: Ignite
>          Issue Type: Bug
>          Components: general, jdbc, networking, persistence
>    Affects Versions: 3.0.0-beta1
>         Environment: The 2 or 3 nodes cluster running locally.
>            Reporter: Igor
>            Priority: Major
>              Labels: ignite-3
>
> *Steps to reproduce:*
>  # Create zone with replication equals to amount of nodes (2 or 3 
> corresponding)
>  # Create 10 tables inside the zone.
>  # Insert 100 rows in every table.
>  # Await all tables*partitions*nodes local state is "HEALTHY"
>  # Await all tables*partitions*nodes global state is "AVAILABLE"
>  # Kill first node with kill -9.
>  # Assert all tables*partitions*nodes local state is "HEALTHY"
>  # Await all tables*partitions*nodes global state is "READ_ONLY" for 2 nodes 
> cluster or "DEGRADED" for 3 nodes cluster,
>  # Execute select query using JDBC connecting to the second node (which is 
> alive).
> *Expected:*
> Data is returned.
> *Actual:*
> On the step 7 it returns error by REST API:
> {code:java}
> {"title":"Internal Server 
> Error","status":500,"code":"IGN-RECOVERY-3","type":null,"detail":"io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: 
> /172.120.6.2:3344","node":null,"traceId":"2acb52fc-3275-411b-a4de-45f14873f15c","invalidParams":null}{code}
> In the server logs continuous errors:
> {code:java}
> 2024-05-08 10:37:19:796 +0200 
> [ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-9][AbstractClientService]
>  Fail to connect ClusterFailover3NodesTest_cluster_0, exception: 
> java.net.ConnectException.
> 2024-05-08 10:37:19:796 +0200 
> [ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-9][ReplicatorGroupImpl]
>  Fail to check replicator connection to 
> peer=ClusterFailover3NodesTest_cluster_0, replicatorType=Follower.
> 2024-05-08 10:37:19:796 +0200 
> [ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-12][AbstractClientService]
>  Fail to connect ClusterFailover3NodesTest_cluster_0, exception: 
> java.net.ConnectException.
> 2024-05-08 10:37:19:796 +0200 
> [ERROR][%ClusterFailover3NodesTest_cluster_1%JRaft-StepDownTimer-12][ReplicatorGroupImpl]
>  Fail to check replicator connection to 
> peer=ClusterFailover3NodesTest_cluster_0, replicatorType=Follower. {code}
> If skip steps 7 and 8, then the exception on step 9 occurs:
> {code:java}
> java.sql.SQLException: Unable to send fragment 
> [targetNode=ClusterFailover3NodesTest_cluster_0, fragmentId=1, 
> cause=io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection 
> refused: no further information: /192.168.100.5:3344]
>     at 
> org.apache.ignite.internal.jdbc.proto.IgniteQueryErrorCode.createJdbcSqlException(IgniteQueryErrorCode.java:57)
>     at 
> org.apache.ignite.internal.jdbc.JdbcStatement.execute0(JdbcStatement.java:154)
>     at 
> org.apache.ignite.internal.jdbc.JdbcStatement.executeQuery(JdbcStatement.java:111)
>     at 
> org.gridgain.ai3tests.tests.teststeps.JdbcSteps.executeQuery(JdbcSteps.java:91)
>     at 
> org.gridgain.ai3tests.tests.failover.ClusterFailoverTestBase.getActualResult(ClusterFailoverTestBase.java:336)
>     at 
> org.gridgain.ai3tests.tests.failover.ClusterFailoverTestBase.assertDataIsFilledWithoutErrors(ClusterFailoverTestBase.java:154)
>     at 
> org.gridgain.ai3tests.tests.failover.ClusterFailover3NodesTest.singleKillAndCheckOtherNodeWorks(ClusterFailover3NodesTest.java:96)
>     at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>     at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:834) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-22187) Cluster of 2 or 3 nodes doesn't work if one node is down

Reply via email to