[
https://issues.apache.org/jira/browse/IGNITE-7212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292472#comment-16292472
]
Alexey Goncharuk commented on IGNITE-7212:
------------------------------------------
The issue is related to
org/apache/ignite/spi/communication/tcp/TcpCommunicationSpi.java:3110
We loop forever here. This is a regression introduced by IGNITE-6639.
> Load stoped after server node kill
> ----------------------------------
>
> Key: IGNITE-7212
> URL: https://issues.apache.org/jira/browse/IGNITE-7212
> Project: Ignite
> Issue Type: Bug
> Components: general
> Affects Versions: 2.4
> Reporter: Ilya Suntsov
> Assignee: Alexey Goncharuk
> Priority: Critical
> Attachments: cfg_log_master_1.zip
>
>
> Scenario:
> * Start 4 servers
> * Start load tasks on 5 clients
> * Kill 1 server
> * Waiting for rebalancing
> * Kill 1 server
> Result:
> After the kill of second servers node load stoped.
> In servers logs I see messages like this:
> {noformat}
> [2017-12-15 11:23:50][DEBUG][grid-nio-worker-tcp-comm-0-#41] Remote client
> closed connection: GridSelectorNioSessionImpl [worker=DirectNioClientWorker
> [super=AbstractNioClientWorker [idx=0, bytesRcvd=130952565,
> bytesSent=131203245, bytesRcvd0=3069200, bytesSent0=3068083, select=true,
> super=GridWorker [name=grid-nio-worker-tcp-comm-0, igniteInstanceName=null,
> finished=false, hashCode=1748650517, interrupted=false,
> runner=grid-nio-worker-tcp-comm-0-#41]]],
> writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
> readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
> inRecovery=GridNioRecoveryDescriptor [acked=1024, resendCnt=0, rcvCnt=1026,
> sentCnt=1029, reserved=true, lastAck=1024, nodeLeft=false,
> node=TcpDiscoveryNode [id=b7cfaa4e-b3b7-4485-a421-c731d9ed869d,
> addrs=[127.0.0.1, 172.31.20.3],
> sockAddrs=[ip-172-31-20-3.us-east-2.compute.internal/172.31.20.3:47500,
> /127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
> lastExchangeTime=1513335739604, loc=false, ver=2.4.0#20171214-sha1:da782958,
> isClient=false], connected=false, connectCnt=1, queueLimit=4096,
> reserveCnt=1, pairedConnections=false], outRecovery=GridNioRecoveryDescriptor
> [acked=1024, resendCnt=0, rcvCnt=1026, sentCnt=1029, reserved=true,
> lastAck=1024, nodeLeft=false, node=TcpDiscoveryNode
> [id=b7cfaa4e-b3b7-4485-a421-c731d9ed869d, addrs=[127.0.0.1, 172.31.20.3],
> sockAddrs=[ip-172-31-20-3.us-east-2.compute.internal/172.31.20.3:47500,
> /127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
> lastExchangeTime=1513335739604, loc=false, ver=2.4.0#20171214-sha1:da782958,
> isClient=false], connected=false, connectCnt=1, queueLimit=4096,
> reserveCnt=1, pairedConnections=false], super=GridNioSessionImpl
> [locAddr=/172.31.23.220:41732,
> rmtAddr=ip-172-31-20-3.us-east-2.compute.internal/172.31.20.3:47100,
> createTime=1513335774008, closeTime=0, bytesSent=131203245,
> bytesRcvd=130952565, bytesSent0=3068083, bytesRcvd0=3069200,
> sndSchedTime=1513335774008, lastSndTime=1513337029027,
> lastRcvTime=1513337029027, readsPaused=false,
> filterChain=FilterChain[filters=[GridNioCodecFilter
> [parser=org.apache.ignite.internal.util.nio.GridDirectParser@11ae7d3b,
> directMode=true], GridConnectionBytesVerifyFilter], accepted=false]]
> [2017-12-15 11:23:50][WARN ][tcp-disco-msg-worker-#2] Failed to send message
> to next node [msg=TcpDiscoveryConnectionCheckMessage
> [super=TcpDiscoveryAbstractMessage [sndNodeId=null,
> id=6c7f6d95061-c3cf9fe4-ab13-4d95-ace3-84a54cd73e08, verifierNodeId=null,
> topVer=0, pendingIdx=0, failedNodes=null, isClient=false]],
> next=TcpDiscoveryNode [id=b7cfaa4e-b3b7-4485-a421-c731d9ed869d,
> addrs=[127.0.0.1, 172.31.20.3],
> sockAddrs=[ip-172-31-20-3.us-east-2.compute.internal/172.31.20.3:47500,
> /127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
> lastExchangeTime=1513335739604, loc=false, ver=2.4.0#20171214-sha1:da782958,
> isClient=false], errMsg=Failed to send message to next node
> [msg=TcpDiscoveryConnectionCheckMessage [super=TcpDiscoveryAbstractMessage
> [sndNodeId=null, id=6c7f6d95061-c3cf9fe4-ab13-4d95-ace3-84a54cd73e08,
> verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null,
> isClient=false]], next=ClusterNode [id=b7cfaa4e-b3b7-4485-a421-c731d9ed869d,
> order=1, addr=[127.0.0.1, 172.31.20.3], daemon=false]]]
> [2017-12-15 11:23:50][DEBUG][grid-nio-worker-tcp-comm-0-#41] Session was
> closed but there are unacknowledged messages, will try to reconnect
> [rmtNode=b7cfaa4e-b3b7-4485-a421-c731d9ed869d]
> [2017-12-15 11:23:50][DEBUG][tcp-comm-worker-#1] Recovery reconnect
> [rmtNode=b7cfaa4e-b3b7-4485-a421-c731d9ed869d]
> [2017-12-15 11:23:50][DEBUG][tcp-comm-worker-#1] Creating NIO client to node:
> TcpDiscoveryNode [id=b7cfaa4e-b3b7-4485-a421-c731d9ed869d, addrs=[127.0.0.1,
> 172.31.20.3],
> sockAddrs=[ip-172-31-20-3.us-east-2.compute.internal/172.31.20.3:47500,
> /127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
> lastExchangeTime=1513335739604, loc=false, ver=2.4.0#20171214-sha1:da782958,
> isClient=false]
> [2017-12-15 11:23:50][DEBUG][tcp-comm-worker-#1] Addresses resolved from
> attributes [rmtNode=b7cfaa4e-b3b7-4485-a421-c731d9ed869d,
> addrs=[ip-172-31-20-3.us-east-2.compute.internal/172.31.20.3:47100,
> /127.0.0.1:47100], isRmtAddrsExist=true]
> [2017-12-15 11:23:50][DEBUG][tcp-comm-worker-#1] Client creation failed
> [addr=ip-172-31-20-3.us-east-2.compute.internal/172.31.20.3:47100,
> err=java.net.ConnectException: Connection refused]
> [2017-12-15 11:23:50][WARN ][tcp-comm-worker-#1] Connect timed out (consider
> increasing 'failureDetectionTimeout' configuration property)
> [addr=ip-172-31-20-3.us-east-2.compute.internal/172.31.20.3:47100,
> failureDetectionTimeout=10000]
> [2017-12-15 11:23:50][WARN ][disco-event-worker-#61] Node FAILED:
> TcpDiscoveryNode [id=b7cfaa4e-b3b7-4485-a421-c731d9ed869d, addrs=[127.0.0.1,
> 172.31.20.3],
> sockAddrs=[ip-172-31-20-3.us-east-2.compute.internal/172.31.20.3:47500,
> /127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
> lastExchangeTime=1513335739604, loc=false, ver=2.4.0#20171214-sha1:da782958,
> isClient=false]
> [2017-12-15 11:23:50][DEBUG][tcp-comm-worker-#1] Skipping local address
> [addr=/127.0.0.1:47100, locAddrs=[172.31.20.3, 127.0.0.1],
> node=TcpDiscoveryNode [id=b7cfaa4e-b3b7-4485-a421-c731d9ed869d,
> addrs=[127.0.0.1, 172.31.20.3],
> sockAddrs=[ip-172-31-20-3.us-east-2.compute.internal/172.31.20.3:47500,
> /127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
> lastExchangeTime=1513335739604, loc=false, ver=2.4.0#20171214-sha1:da782958,
> isClient=false]]
> [2017-12-15 11:23:50][DEBUG][tcp-comm-worker-#1] Skipping local address
> [addr=/127.0.0.1:47100, locAddrs=[172.31.20.3, 127.0.0.1],
> node=TcpDiscoveryNode [id=b7cfaa4e-b3b7-4485-a421-c731d9ed869d,
> addrs=[127.0.0.1, 172.31.20.3],
> sockAddrs=[ip-172-31-20-3.us-east-2.compute.internal/172.31.20.3:47500,
> /127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
> lastExchangeTime=1513335739604, loc=false, ver=2.4.0#20171214-sha1:da782958,
> isClient=false]]
> {noformat}
> Logs and configs was attached to this ticket.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)