Alexey Kukushkin created IGNITE-15184:
-----------------------------------------

             Summary: Ignite server cannot reconnect client: Close incoming 
connection, unknown node
                 Key: IGNITE-15184
                 URL: https://issues.apache.org/jira/browse/IGNITE-15184
             Project: Ignite
          Issue Type: Bug
          Components: general
    Affects Versions: 2.9.1, 2.7
         Environment: Reproduced on AI 2.7 and 2.9.1

Client OS: Windows Server 2012 R2 6.3 amd64

Server OS: Linux 3.10.0-1062.9.1.el7.x86_64 amd64

See the Ignite configuration in the client and server logs in the Description
            Reporter: Alexey Kukushkin


Ignite client nodes eventually lose connectivity with the cluster and never 
reconnect. The below logs show the server rejects the client reconnection with 
a warning "{color:#de350b}Close incoming connection, unknown node{color}".

It is possible the problem was caused by a severely degraded network 
throughput. However, the server should keep the client session for 
"clientFailureDetectionTimeout" (10 seconds by default) and the client should 
keep trying to re-connect automatically during that time. Instead, we see the 
server immediately says "Close incoming connection, unknown node" and the 
client never re-connects.

Ignite should survive temporary network problems (with duration less than 
clientFailureDetectionTimeout).

 

See the server and client errors and configurations in the logs below.
h3. Server

{{[05:33:00,946][INFO][main][IgniteKernal%StaticGrid] IgniteConfiguration 
[igniteInstanceName=StaticGrid, pubPoolSize=16, svcPoolSize=16, 
callbackPoolSize=16, stripedPoolSize=16, sysPoolSize=16, mgmtPoolSize=4, 
dataStreamerPoolSize=16, utilityCachePoolSize=16, 
utilityCacheKeepAliveTime=60000, p2pPoolSize=2, qryPoolSize=16, 
buildIdxPoolSize=4, igniteHome=/ignite/apache-ignite-2.9.1-bin, 
igniteWorkDir=/ignite/apache-ignite-2.9.1-bin/work, 
mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6f94fa3e, 
nodeId=e5845e09-df08-4b85-b633-f14dec8f8106, marsh=BinaryMarshaller [], 
marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000, 
netCompressionLevel=1, sndRetryDelay=1000, sndRetryCnt=3, 
metricsHistSize=10000, metricsUpdateFreq=2000, 
metricsExpTime=9223372036854775807, discoSpi=TcpDiscoverySpi [addrRslvr=null, 
sockTimeout=0, ackTimeout=0, marsh=null, reconCnt=10, reconDelay=2000, 
maxAckTimeout=600000, soLinger=0, forceSrvMode=false, 
clientReconnectDisabled=false, internalLsnr=null, 
skipAddrsRandomization=false], segPlc=RESTART_JVM, segResolveAttempts=10, 
waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=20000, 
commSpi=TcpCommunicationSpi [connectGate=null, 
connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@62ddbd7e,
 chConnPlc=null, enableForcibleNodeKill=false, enableTroubleshootingLog=false, 
locAddr=null, locHost=null, locPort=51605, locPortRange=100, shmemPort=-1, 
directBuf=true, directSndBuf=false, idleConnTimeout=600000, connTimeout=5000, 
maxConnTimeout=600000, reconCnt=10, sockSndBuf=32768, sockRcvBuf=32768, 
msgQueueLimit=0, slowClientQueueLimit=0, nioSrvr=null, shmemSrv=null, 
usePairedConnections=false, connectionsPerNode=1, tcpNoDelay=true, 
filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0, 
sockWriteTimeout=2000, boundTcpPort=-1, boundTcpShmemPort=-1, selectorsCnt=8, 
selectorSpins=0, forceClientToSrvConnections=false, addrRslvr=null, 
ctxInitLatch=java.util.concurrent.CountDownLatch@74e52ef6[Count = 1], 
stopping=false, metricsLsnr=null, connectionRequestor=null], 
evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@6ed3ccb2, 
colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [], 
indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@3c9754d8, 
addrRslvr=null, 
encryptionSpi=org.apache.ignite.spi.encryption.noop.NoopEncryptionSpi@3bf7ca37, 
tracingSpi=org.apache.ignite.spi.tracing.NoopTracingSpi@79079097, 
clientMode=false, rebalanceThreadPoolSize=4, rebalanceTimeout=10000, 
rebalanceBatchesPrefetchCnt=3, rebalanceThrottle=0, rebalanceBatchSize=524288, 
txCfg=TransactionConfiguration [txSerEnabled=false, 
dfltIsolation=REPEATABLE_READ, dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0, 
txTimeoutOnPartitionMapExchange=0, deadlockTimeout=10000, 
pessimisticTxLogSize=0, pessimisticTxLogLinger=10000, tmLookupClsName=null, 
txManagerFactory=null, useJtaSync=false], cacheSanityCheckEnabled=true, 
discoStartupDelay=60000, deployMode=SHARED, p2pMissedCacheSize=100, 
locHost=10.212.120.66, timeSrvPortBase=31100, timeSrvPortRange=100, 
failureDetectionTimeout=120000, sysWorkerBlockedTimeout=null, 
clientFailureDetectionTimeout=120000, metricsLogFreq=60000, 
connectorCfg=ConnectorConfiguration [jettyPath=null, host=null, port=11211, 
noDelay=true, directBuf=false, sndBufSize=32768, rcvBufSize=32768, 
idleQryCurTimeout=600000, idleQryCurCheckFreq=60000, sndQueueLimit=0, 
selectorCnt=4, idleTimeout=7000, sslEnabled=false, sslClientAuth=false, 
sslCtxFactory=null, sslFactory=null, portRange=100, threadPoolSize=16, 
msgInterceptor=null], odbcCfg=null, warmupClos=null, 
atomicCfg=AtomicConfiguration [seqReserveSize=1000, cacheMode=PARTITIONED, 
backups=1, aff=null, grpName=null], classLdr=null, sslCtxFactory=null, 
platformCfg=null, binaryCfg=null, memCfg=null, pstCfg=null, 
dsCfg=DataStorageConfiguration [sysRegionInitSize=41943040, 
sysRegionMaxSize=104857600, pageSize=0, concLvl=0, 
dfltDataRegConf=DataRegionConfiguration [name=Static_Data_Region, 
maxSize=34359738368, initSize=1073741824, swapPath=null, 
pageEvictionMode=DISABLED, evictionThreshold=0.9, emptyPagesPoolSize=100, 
metricsEnabled=false, metricsSubIntervalCount=5, metricsRateTimeInterval=60000, 
persistenceEnabled=false, checkpointPageBufSize=0, lazyMemoryAllocation=true], 
dataRegions=null, storagePath=null, checkpointFreq=180000, lockWaitTime=10000, 
checkpointThreads=4, checkpointWriteOrder=SEQUENTIAL, walHistSize=20, 
maxWalArchiveSize=1073741824, walSegments=10, walSegmentSize=67108864, 
walPath=db/wal, walArchivePath=db/wal/archive, metricsEnabled=false, 
walMode=LOG_ONLY, walTlbSize=131072, walBuffSize=0, walFlushFreq=2000, 
walFsyncDelay=1000, walRecordIterBuffSize=67108864, alwaysWriteFullPages=false, 
fileIOFactory=org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory@2925bf5b,
 metricsSubIntervalCnt=5, metricsRateTimeInterval=60000, 
walAutoArchiveAfterInactivity=-1, writeThrottlingEnabled=false, 
walCompactionEnabled=false, walCompactionLevel=1, 
checkpointReadLockTimeout=null, walPageCompression=DISABLED, 
walPageCompressionLevel=null], snapshotPath=snapshots, activeOnStart=true, 
activeOnStartPropSetFlag=false, autoActivation=true, 
autoActivationPropSetFlag=false, clusterStateOnStart=null, sqlConnCfg=null, 
cliConnCfg=ClientConnectorConfiguration [host=null, port=10800, portRange=100, 
sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true, maxOpenCursorsPerConn=128, 
threadPoolSize=16, idleTimeout=0, handshakeTimeout=10000, jdbcEnabled=true, 
odbcEnabled=true, thinCliEnabled=true, sslEnabled=false, 
useIgniteSslCtxFactory=true, sslClientAuth=false, sslCtxFactory=null, 
thinCliCfg=ThinClientConfiguration [maxActiveTxPerConn=100, 
maxActiveComputeTasksPerConn=0]], mvccVacuumThreadCnt=2, mvccVacuumFreq=5000, 
authEnabled=false, failureHnd=null, commFailureRslvr=null, 
sqlCfg=SqlConfiguration [longQryWarnTimeout=4000, dfltQryTimeout=0, 
sqlQryHistSize=1000, validationEnabled=false]]}}
{{...}}
{{[15:06:29,740][WARNING][tcp-disco-msg-worker-[8817efc3 10.212.120.67:51500 
crd]-#2%StaticGrid%-#62%StaticGrid%][TcpDiscoverySpi] Failing client node due 
to not receiving metrics updates from client node within 
'IgniteConfiguration.clientFailureDetectionTimeout' (consider increasing 
configuration property) [timeout=120000, node=TcpDiscoveryNode 
[id=841c8962-775e-4109-bc74-2c003598a02d, 
consistentId=A9428E28-58B8-4860-B038-9F2B392D5FC4, addrs=ArrayList 
[10.212.120.189], sockAddrs=HashSet [AX07079/10.212.120.189:0], discPort=0, 
order=276, intOrder=140, lastExchangeTime=1613261110451, loc=false, 
ver=2.9.1#20201203-sha1:adcce517, isClient=true]]}}
{{[15:06:29,743][WARNING][disco-event-worker-#64%StaticGrid%][GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=841c8962-775e-4109-bc74-2c003598a02d, 
consistentId=A9428E28-58B8-4860-B038-9F2B392D5FC4, addrs=ArrayList 
[10.212.120.189], sockAddrs=HashSet [AX07079/10.212.120.189:0], discPort=0, 
order=276, intOrder=140, lastExchangeTime=1613261110451, loc=false, 
ver=2.9.1#20201203-sha1:adcce517, isClient=true]}}
{{[15:06:29,744][INFO][disco-event-worker-#64%StaticGrid%][GridDiscoveryManager]
 Topology snapshot [ver=290, locNode=e5845e09, servers=2, clients=14, 
state=ACTIVE, CPUs=188, offheap=64.0GB, heap=97.0GB]}}
{{[15:06:29,744][INFO][disco-event-worker-#64%StaticGrid%][GridDiscoveryManager]
 ^-- Baseline [id=0, size=2, online=2, offline=0]}}
{{[15:06:29,747][INFO][grid-nio-worker-tcp-comm-7-#46%StaticGrid%][TcpCommunicationSpi]
 Accepted incoming communication connection [locAddr=/10.212.120.66:51605, 
rmtAddr=/10.212.120.189:62773]}}
{{[15:06:29,747][INFO][exchange-worker-#66%StaticGrid%][time] Started exchange 
init [topVer=AffinityTopologyVersion [topVer=290, minorTopVer=0], crd=true, 
evt=NODE_FAILED, evtNode=841c8962-775e-4109-bc74-2c003598a02d, customEvt=null, 
allowMerge=true, exchangeFreeSwitch=false]}}
{{{color:#de350b}[15:06:29,748][WARNING][grid-nio-worker-tcp-comm-7-#46%StaticGrid%][TcpCommunicationSpi]
 Close incoming connection, unknown node 
[nodeId=841c8962-775e-4109-bc74-2c003598a02d, ses=GridSelectorNioSessionImpl 
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=7, 
bytesRcvd=129090683, bytesSent=3949813526, bytesRcvd0=42, bytesSent0=18, 
select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-7, 
igniteInstanceName=StaticGrid, finished=false, heartbeatTs=1613315189741, 
hashCode=1034460111, interrupted=false, 
runner=grid-nio-worker-tcp-comm-7-#46%StaticGrid%]]], 
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], 
readBuf=java.nio.DirectByteBuffer[pos=38 lim=38 cap=32768], inRecovery=null, 
outRecovery=null, closeSocket=true, 
outboundMessagesQueueSizeMetric=o.a.i.i.processors.metric.impl.LongAdderMetric@69a257d1,
 super=GridNioSessionImpl [locAddr=/10.212.120.66:51605, 
rmtAddr=/10.212.120.189:62773, createTime=1613315189741, closeTime=0, 
bytesSent=18, bytesRcvd=42, bytesSent0=18, bytesRcvd0=42, 
sndSchedTime=1613315189741, lastSndTime=1613315189741, 
lastRcvTime=1613315189741, readsPaused=false, 
filterChain=FilterChain[filters=[GridNioCodecFilter 
[parser=o.a.i.i.util.nio.GridDirectParser@15582c29, directMode=true], 
GridConnectionBytesVerifyFilter], accepted=true, 
markedForClose=false]]]{color}}}
h3. Client

{{[00:42:07,828][INFO][main][IgniteKernal%ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89]
 IgniteConfiguration 
[igniteInstanceName=ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89, 
pubPoolSize=12, svcPoolSize=12, callbackPoolSize=12, stripedPoolSize=12, 
sysPoolSize=12, mgmtPoolSize=4, dataStreamerPoolSize=12, 
utilityCachePoolSize=12, utilityCacheKeepAliveTime=60000, p2pPoolSize=2, 
qryPoolSize=12, buildIdxPoolSize=3, 
igniteHome=__XXXX__\apache-ignite-2.9.1-bin, igniteWorkDir=__XXXX__\work, 
mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@72d1ad2e, 
nodeId=04624fa9-4390-408c-8ca7-df0f11466f86, marsh=BinaryMarshaller [], 
marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000, 
netCompressionLevel=1, sndRetryDelay=1000, sndRetryCnt=3, 
metricsHistSize=10000, metricsUpdateFreq=2000, 
metricsExpTime=9223372036854775807, discoSpi=TcpDiscoverySpi [addrRslvr=null, 
sockTimeout=0, ackTimeout=0, marsh=null, reconCnt=10, reconDelay=2000, 
maxAckTimeout=600000, soLinger=0, forceSrvMode=false, 
clientReconnectDisabled=false, internalLsnr=null, 
skipAddrsRandomization=false], segPlc=RESTART_JVM, segResolveAttempts=10, 
waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=20000, 
commSpi=TcpCommunicationSpi [connectGate=null, 
connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@53aad5d5,
 chConnPlc=null, enableForcibleNodeKill=false, enableTroubleshootingLog=false, 
locAddr=null, locHost=null, locPort=51605, locPortRange=100, shmemPort=-1, 
directBuf=true, directSndBuf=false, idleConnTimeout=600000, connTimeout=5000, 
maxConnTimeout=600000, reconCnt=10, sockSndBuf=32768, sockRcvBuf=32768, 
msgQueueLimit=0, slowClientQueueLimit=0, nioSrvr=null, shmemSrv=null, 
usePairedConnections=false, connectionsPerNode=1, tcpNoDelay=true, 
filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0, 
sockWriteTimeout=2000, boundTcpPort=-1, boundTcpShmemPort=-1, selectorsCnt=6, 
selectorSpins=0, forceClientToSrvConnections=false, addrRslvr=null, 
ctxInitLatch=java.util.concurrent.CountDownLatch@46fa7c39[Count = 1], 
stopping=false, metricsLsnr=null, connectionRequestor=null], 
evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@1fb700ee, 
colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [], 
indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@68e965f5, 
addrRslvr=null, 
encryptionSpi=org.apache.ignite.spi.encryption.noop.NoopEncryptionSpi@6f27a732, 
tracingSpi=org.apache.ignite.spi.tracing.NoopTracingSpi@6c779568, 
clientMode=true, rebalanceThreadPoolSize=4, rebalanceTimeout=10000, 
rebalanceBatchesPrefetchCnt=3, rebalanceThrottle=0, rebalanceBatchSize=524288, 
txCfg=TransactionConfiguration [txSerEnabled=false, 
dfltIsolation=REPEATABLE_READ, dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0, 
txTimeoutOnPartitionMapExchange=0, deadlockTimeout=10000, 
pessimisticTxLogSize=0, pessimisticTxLogLinger=10000, tmLookupClsName=null, 
txManagerFactory=null, useJtaSync=false], cacheSanityCheckEnabled=true, 
discoStartupDelay=60000, deployMode=SHARED, p2pMissedCacheSize=100, 
locHost=10.212.120.180, timeSrvPortBase=31100, timeSrvPortRange=100, 
failureDetectionTimeout=120000, sysWorkerBlockedTimeout=null, 
clientFailureDetectionTimeout=120000, metricsLogFreq=60000, 
connectorCfg=ConnectorConfiguration [jettyPath=null, host=null, port=11211, 
noDelay=true, directBuf=false, sndBufSize=32768, rcvBufSize=32768, 
idleQryCurTimeout=600000, idleQryCurCheckFreq=60000, sndQueueLimit=0, 
selectorCnt=4, idleTimeout=7000, sslEnabled=false, sslClientAuth=false, 
sslCtxFactory=null, sslFactory=null, portRange=100, threadPoolSize=12, 
msgInterceptor=null], odbcCfg=null, warmupClos=null, 
atomicCfg=AtomicConfiguration [seqReserveSize=1000, cacheMode=PARTITIONED, 
backups=1, aff=null, grpName=null], classLdr=null, sslCtxFactory=null, 
platformCfg=PlatformDotNetConfiguration [binaryCfg=null], binaryCfg=null, 
memCfg=null, pstCfg=null, dsCfg=null, snapshotPath=snapshots, 
activeOnStart=true, activeOnStartPropSetFlag=false, autoActivation=true, 
autoActivationPropSetFlag=false, clusterStateOnStart=null, sqlConnCfg=null, 
cliConnCfg=ClientConnectorConfiguration [host=null, port=10800, portRange=100, 
sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true, maxOpenCursorsPerConn=128, 
threadPoolSize=12, idleTimeout=0, handshakeTimeout=10000, jdbcEnabled=true, 
odbcEnabled=true, thinCliEnabled=true, sslEnabled=false, 
useIgniteSslCtxFactory=true, sslClientAuth=false, sslCtxFactory=null, 
thinCliCfg=ThinClientConfiguration [maxActiveTxPerConn=100, 
maxActiveComputeTasksPerConn=0]], mvccVacuumThreadCnt=2, mvccVacuumFreq=5000, 
authEnabled=false, failureHnd=null, commFailureRslvr=null, 
sqlCfg=SqlConfiguration [longQryWarnTimeout=4000, dfltQryTimeout=0, 
sqlQryHistSize=1000, validationEnabled=false]]}}
{{...}}
{{{color:#de350b}[15:06:29,839][WARNING][disco-event-worker-#51%ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89%][GridDiscoveryManager]
 Node FAILED: TcpDiscoveryNode [id=841c8962-775e-4109-bc74-2c003598a02d, 
consistentId=A9428E28-58B8-4860-B038-9F2B392D5FC4, addrs=ArrayList 
[10.212.120.189], sockAddrs=HashSet [AX07079/10.212.120.189:0], discPort=0, 
order=276, intOrder=140, lastExchangeTime=1613263329640, loc=false, 
ver=2.9.1#20201203-sha1:adcce517, isClient=true]{color}}}
{{[15:06:29,839][INFO][disco-event-worker-#51%ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89%][GridDiscoveryManager]
 Topology snapshot [ver=290, locNode=04624fa9, servers=2, clients=14, 
state=ACTIVE, CPUs=188, offheap=64.0GB, heap=97.0GB]}}
{{[15:06:29,839][INFO][disco-event-worker-#51%ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89%][GridDiscoveryManager]
 ^-- Baseline [id=0, size=2, online=2, offline=0]}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to