Alexey Kukushkin created IGNITE-15184:
-----------------------------------------
Summary: Ignite server cannot reconnect client: Close incoming
connection, unknown node
Key: IGNITE-15184
URL: https://issues.apache.org/jira/browse/IGNITE-15184
Project: Ignite
Issue Type: Bug
Components: general
Affects Versions: 2.9.1, 2.7
Environment: Reproduced on AI 2.7 and 2.9.1
Client OS: Windows Server 2012 R2 6.3 amd64
Server OS: Linux 3.10.0-1062.9.1.el7.x86_64 amd64
See the Ignite configuration in the client and server logs in the Description
Reporter: Alexey Kukushkin
Ignite client nodes eventually lose connectivity with the cluster and never
reconnect. The below logs show the server rejects the client reconnection with
a warning "{color:#de350b}Close incoming connection, unknown node{color}".
It is possible the problem was caused by a severely degraded network
throughput. However, the server should keep the client session for
"clientFailureDetectionTimeout" (10 seconds by default) and the client should
keep trying to re-connect automatically during that time. Instead, we see the
server immediately says "Close incoming connection, unknown node" and the
client never re-connects.
Ignite should survive temporary network problems (with duration less than
clientFailureDetectionTimeout).
See the server and client errors and configurations in the logs below.
h3. Server
{{[05:33:00,946][INFO][main][IgniteKernal%StaticGrid] IgniteConfiguration
[igniteInstanceName=StaticGrid, pubPoolSize=16, svcPoolSize=16,
callbackPoolSize=16, stripedPoolSize=16, sysPoolSize=16, mgmtPoolSize=4,
dataStreamerPoolSize=16, utilityCachePoolSize=16,
utilityCacheKeepAliveTime=60000, p2pPoolSize=2, qryPoolSize=16,
buildIdxPoolSize=4, igniteHome=/ignite/apache-ignite-2.9.1-bin,
igniteWorkDir=/ignite/apache-ignite-2.9.1-bin/work,
mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6f94fa3e,
nodeId=e5845e09-df08-4b85-b633-f14dec8f8106, marsh=BinaryMarshaller [],
marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000,
netCompressionLevel=1, sndRetryDelay=1000, sndRetryCnt=3,
metricsHistSize=10000, metricsUpdateFreq=2000,
metricsExpTime=9223372036854775807, discoSpi=TcpDiscoverySpi [addrRslvr=null,
sockTimeout=0, ackTimeout=0, marsh=null, reconCnt=10, reconDelay=2000,
maxAckTimeout=600000, soLinger=0, forceSrvMode=false,
clientReconnectDisabled=false, internalLsnr=null,
skipAddrsRandomization=false], segPlc=RESTART_JVM, segResolveAttempts=10,
waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=20000,
commSpi=TcpCommunicationSpi [connectGate=null,
connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@62ddbd7e,
chConnPlc=null, enableForcibleNodeKill=false, enableTroubleshootingLog=false,
locAddr=null, locHost=null, locPort=51605, locPortRange=100, shmemPort=-1,
directBuf=true, directSndBuf=false, idleConnTimeout=600000, connTimeout=5000,
maxConnTimeout=600000, reconCnt=10, sockSndBuf=32768, sockRcvBuf=32768,
msgQueueLimit=0, slowClientQueueLimit=0, nioSrvr=null, shmemSrv=null,
usePairedConnections=false, connectionsPerNode=1, tcpNoDelay=true,
filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0,
sockWriteTimeout=2000, boundTcpPort=-1, boundTcpShmemPort=-1, selectorsCnt=8,
selectorSpins=0, forceClientToSrvConnections=false, addrRslvr=null,
ctxInitLatch=java.util.concurrent.CountDownLatch@74e52ef6[Count = 1],
stopping=false, metricsLsnr=null, connectionRequestor=null],
evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@6ed3ccb2,
colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [],
indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@3c9754d8,
addrRslvr=null,
encryptionSpi=org.apache.ignite.spi.encryption.noop.NoopEncryptionSpi@3bf7ca37,
tracingSpi=org.apache.ignite.spi.tracing.NoopTracingSpi@79079097,
clientMode=false, rebalanceThreadPoolSize=4, rebalanceTimeout=10000,
rebalanceBatchesPrefetchCnt=3, rebalanceThrottle=0, rebalanceBatchSize=524288,
txCfg=TransactionConfiguration [txSerEnabled=false,
dfltIsolation=REPEATABLE_READ, dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0,
txTimeoutOnPartitionMapExchange=0, deadlockTimeout=10000,
pessimisticTxLogSize=0, pessimisticTxLogLinger=10000, tmLookupClsName=null,
txManagerFactory=null, useJtaSync=false], cacheSanityCheckEnabled=true,
discoStartupDelay=60000, deployMode=SHARED, p2pMissedCacheSize=100,
locHost=10.212.120.66, timeSrvPortBase=31100, timeSrvPortRange=100,
failureDetectionTimeout=120000, sysWorkerBlockedTimeout=null,
clientFailureDetectionTimeout=120000, metricsLogFreq=60000,
connectorCfg=ConnectorConfiguration [jettyPath=null, host=null, port=11211,
noDelay=true, directBuf=false, sndBufSize=32768, rcvBufSize=32768,
idleQryCurTimeout=600000, idleQryCurCheckFreq=60000, sndQueueLimit=0,
selectorCnt=4, idleTimeout=7000, sslEnabled=false, sslClientAuth=false,
sslCtxFactory=null, sslFactory=null, portRange=100, threadPoolSize=16,
msgInterceptor=null], odbcCfg=null, warmupClos=null,
atomicCfg=AtomicConfiguration [seqReserveSize=1000, cacheMode=PARTITIONED,
backups=1, aff=null, grpName=null], classLdr=null, sslCtxFactory=null,
platformCfg=null, binaryCfg=null, memCfg=null, pstCfg=null,
dsCfg=DataStorageConfiguration [sysRegionInitSize=41943040,
sysRegionMaxSize=104857600, pageSize=0, concLvl=0,
dfltDataRegConf=DataRegionConfiguration [name=Static_Data_Region,
maxSize=34359738368, initSize=1073741824, swapPath=null,
pageEvictionMode=DISABLED, evictionThreshold=0.9, emptyPagesPoolSize=100,
metricsEnabled=false, metricsSubIntervalCount=5, metricsRateTimeInterval=60000,
persistenceEnabled=false, checkpointPageBufSize=0, lazyMemoryAllocation=true],
dataRegions=null, storagePath=null, checkpointFreq=180000, lockWaitTime=10000,
checkpointThreads=4, checkpointWriteOrder=SEQUENTIAL, walHistSize=20,
maxWalArchiveSize=1073741824, walSegments=10, walSegmentSize=67108864,
walPath=db/wal, walArchivePath=db/wal/archive, metricsEnabled=false,
walMode=LOG_ONLY, walTlbSize=131072, walBuffSize=0, walFlushFreq=2000,
walFsyncDelay=1000, walRecordIterBuffSize=67108864, alwaysWriteFullPages=false,
fileIOFactory=org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory@2925bf5b,
metricsSubIntervalCnt=5, metricsRateTimeInterval=60000,
walAutoArchiveAfterInactivity=-1, writeThrottlingEnabled=false,
walCompactionEnabled=false, walCompactionLevel=1,
checkpointReadLockTimeout=null, walPageCompression=DISABLED,
walPageCompressionLevel=null], snapshotPath=snapshots, activeOnStart=true,
activeOnStartPropSetFlag=false, autoActivation=true,
autoActivationPropSetFlag=false, clusterStateOnStart=null, sqlConnCfg=null,
cliConnCfg=ClientConnectorConfiguration [host=null, port=10800, portRange=100,
sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true, maxOpenCursorsPerConn=128,
threadPoolSize=16, idleTimeout=0, handshakeTimeout=10000, jdbcEnabled=true,
odbcEnabled=true, thinCliEnabled=true, sslEnabled=false,
useIgniteSslCtxFactory=true, sslClientAuth=false, sslCtxFactory=null,
thinCliCfg=ThinClientConfiguration [maxActiveTxPerConn=100,
maxActiveComputeTasksPerConn=0]], mvccVacuumThreadCnt=2, mvccVacuumFreq=5000,
authEnabled=false, failureHnd=null, commFailureRslvr=null,
sqlCfg=SqlConfiguration [longQryWarnTimeout=4000, dfltQryTimeout=0,
sqlQryHistSize=1000, validationEnabled=false]]}}
{{...}}
{{[15:06:29,740][WARNING][tcp-disco-msg-worker-[8817efc3 10.212.120.67:51500
crd]-#2%StaticGrid%-#62%StaticGrid%][TcpDiscoverySpi] Failing client node due
to not receiving metrics updates from client node within
'IgniteConfiguration.clientFailureDetectionTimeout' (consider increasing
configuration property) [timeout=120000, node=TcpDiscoveryNode
[id=841c8962-775e-4109-bc74-2c003598a02d,
consistentId=A9428E28-58B8-4860-B038-9F2B392D5FC4, addrs=ArrayList
[10.212.120.189], sockAddrs=HashSet [AX07079/10.212.120.189:0], discPort=0,
order=276, intOrder=140, lastExchangeTime=1613261110451, loc=false,
ver=2.9.1#20201203-sha1:adcce517, isClient=true]]}}
{{[15:06:29,743][WARNING][disco-event-worker-#64%StaticGrid%][GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=841c8962-775e-4109-bc74-2c003598a02d,
consistentId=A9428E28-58B8-4860-B038-9F2B392D5FC4, addrs=ArrayList
[10.212.120.189], sockAddrs=HashSet [AX07079/10.212.120.189:0], discPort=0,
order=276, intOrder=140, lastExchangeTime=1613261110451, loc=false,
ver=2.9.1#20201203-sha1:adcce517, isClient=true]}}
{{[15:06:29,744][INFO][disco-event-worker-#64%StaticGrid%][GridDiscoveryManager]
Topology snapshot [ver=290, locNode=e5845e09, servers=2, clients=14,
state=ACTIVE, CPUs=188, offheap=64.0GB, heap=97.0GB]}}
{{[15:06:29,744][INFO][disco-event-worker-#64%StaticGrid%][GridDiscoveryManager]
^-- Baseline [id=0, size=2, online=2, offline=0]}}
{{[15:06:29,747][INFO][grid-nio-worker-tcp-comm-7-#46%StaticGrid%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/10.212.120.66:51605,
rmtAddr=/10.212.120.189:62773]}}
{{[15:06:29,747][INFO][exchange-worker-#66%StaticGrid%][time] Started exchange
init [topVer=AffinityTopologyVersion [topVer=290, minorTopVer=0], crd=true,
evt=NODE_FAILED, evtNode=841c8962-775e-4109-bc74-2c003598a02d, customEvt=null,
allowMerge=true, exchangeFreeSwitch=false]}}
{{{color:#de350b}[15:06:29,748][WARNING][grid-nio-worker-tcp-comm-7-#46%StaticGrid%][TcpCommunicationSpi]
Close incoming connection, unknown node
[nodeId=841c8962-775e-4109-bc74-2c003598a02d, ses=GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=7,
bytesRcvd=129090683, bytesSent=3949813526, bytesRcvd0=42, bytesSent0=18,
select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-7,
igniteInstanceName=StaticGrid, finished=false, heartbeatTs=1613315189741,
hashCode=1034460111, interrupted=false,
runner=grid-nio-worker-tcp-comm-7-#46%StaticGrid%]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=38 lim=38 cap=32768], inRecovery=null,
outRecovery=null, closeSocket=true,
outboundMessagesQueueSizeMetric=o.a.i.i.processors.metric.impl.LongAdderMetric@69a257d1,
super=GridNioSessionImpl [locAddr=/10.212.120.66:51605,
rmtAddr=/10.212.120.189:62773, createTime=1613315189741, closeTime=0,
bytesSent=18, bytesRcvd=42, bytesSent0=18, bytesRcvd0=42,
sndSchedTime=1613315189741, lastSndTime=1613315189741,
lastRcvTime=1613315189741, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@15582c29, directMode=true],
GridConnectionBytesVerifyFilter], accepted=true,
markedForClose=false]]]{color}}}
h3. Client
{{[00:42:07,828][INFO][main][IgniteKernal%ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89]
IgniteConfiguration
[igniteInstanceName=ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89,
pubPoolSize=12, svcPoolSize=12, callbackPoolSize=12, stripedPoolSize=12,
sysPoolSize=12, mgmtPoolSize=4, dataStreamerPoolSize=12,
utilityCachePoolSize=12, utilityCacheKeepAliveTime=60000, p2pPoolSize=2,
qryPoolSize=12, buildIdxPoolSize=3,
igniteHome=__XXXX__\apache-ignite-2.9.1-bin, igniteWorkDir=__XXXX__\work,
mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@72d1ad2e,
nodeId=04624fa9-4390-408c-8ca7-df0f11466f86, marsh=BinaryMarshaller [],
marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000,
netCompressionLevel=1, sndRetryDelay=1000, sndRetryCnt=3,
metricsHistSize=10000, metricsUpdateFreq=2000,
metricsExpTime=9223372036854775807, discoSpi=TcpDiscoverySpi [addrRslvr=null,
sockTimeout=0, ackTimeout=0, marsh=null, reconCnt=10, reconDelay=2000,
maxAckTimeout=600000, soLinger=0, forceSrvMode=false,
clientReconnectDisabled=false, internalLsnr=null,
skipAddrsRandomization=false], segPlc=RESTART_JVM, segResolveAttempts=10,
waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=20000,
commSpi=TcpCommunicationSpi [connectGate=null,
connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@53aad5d5,
chConnPlc=null, enableForcibleNodeKill=false, enableTroubleshootingLog=false,
locAddr=null, locHost=null, locPort=51605, locPortRange=100, shmemPort=-1,
directBuf=true, directSndBuf=false, idleConnTimeout=600000, connTimeout=5000,
maxConnTimeout=600000, reconCnt=10, sockSndBuf=32768, sockRcvBuf=32768,
msgQueueLimit=0, slowClientQueueLimit=0, nioSrvr=null, shmemSrv=null,
usePairedConnections=false, connectionsPerNode=1, tcpNoDelay=true,
filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0,
sockWriteTimeout=2000, boundTcpPort=-1, boundTcpShmemPort=-1, selectorsCnt=6,
selectorSpins=0, forceClientToSrvConnections=false, addrRslvr=null,
ctxInitLatch=java.util.concurrent.CountDownLatch@46fa7c39[Count = 1],
stopping=false, metricsLsnr=null, connectionRequestor=null],
evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@1fb700ee,
colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [],
indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@68e965f5,
addrRslvr=null,
encryptionSpi=org.apache.ignite.spi.encryption.noop.NoopEncryptionSpi@6f27a732,
tracingSpi=org.apache.ignite.spi.tracing.NoopTracingSpi@6c779568,
clientMode=true, rebalanceThreadPoolSize=4, rebalanceTimeout=10000,
rebalanceBatchesPrefetchCnt=3, rebalanceThrottle=0, rebalanceBatchSize=524288,
txCfg=TransactionConfiguration [txSerEnabled=false,
dfltIsolation=REPEATABLE_READ, dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0,
txTimeoutOnPartitionMapExchange=0, deadlockTimeout=10000,
pessimisticTxLogSize=0, pessimisticTxLogLinger=10000, tmLookupClsName=null,
txManagerFactory=null, useJtaSync=false], cacheSanityCheckEnabled=true,
discoStartupDelay=60000, deployMode=SHARED, p2pMissedCacheSize=100,
locHost=10.212.120.180, timeSrvPortBase=31100, timeSrvPortRange=100,
failureDetectionTimeout=120000, sysWorkerBlockedTimeout=null,
clientFailureDetectionTimeout=120000, metricsLogFreq=60000,
connectorCfg=ConnectorConfiguration [jettyPath=null, host=null, port=11211,
noDelay=true, directBuf=false, sndBufSize=32768, rcvBufSize=32768,
idleQryCurTimeout=600000, idleQryCurCheckFreq=60000, sndQueueLimit=0,
selectorCnt=4, idleTimeout=7000, sslEnabled=false, sslClientAuth=false,
sslCtxFactory=null, sslFactory=null, portRange=100, threadPoolSize=12,
msgInterceptor=null], odbcCfg=null, warmupClos=null,
atomicCfg=AtomicConfiguration [seqReserveSize=1000, cacheMode=PARTITIONED,
backups=1, aff=null, grpName=null], classLdr=null, sslCtxFactory=null,
platformCfg=PlatformDotNetConfiguration [binaryCfg=null], binaryCfg=null,
memCfg=null, pstCfg=null, dsCfg=null, snapshotPath=snapshots,
activeOnStart=true, activeOnStartPropSetFlag=false, autoActivation=true,
autoActivationPropSetFlag=false, clusterStateOnStart=null, sqlConnCfg=null,
cliConnCfg=ClientConnectorConfiguration [host=null, port=10800, portRange=100,
sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true, maxOpenCursorsPerConn=128,
threadPoolSize=12, idleTimeout=0, handshakeTimeout=10000, jdbcEnabled=true,
odbcEnabled=true, thinCliEnabled=true, sslEnabled=false,
useIgniteSslCtxFactory=true, sslClientAuth=false, sslCtxFactory=null,
thinCliCfg=ThinClientConfiguration [maxActiveTxPerConn=100,
maxActiveComputeTasksPerConn=0]], mvccVacuumThreadCnt=2, mvccVacuumFreq=5000,
authEnabled=false, failureHnd=null, commFailureRslvr=null,
sqlCfg=SqlConfiguration [longQryWarnTimeout=4000, dfltQryTimeout=0,
sqlQryHistSize=1000, validationEnabled=false]]}}
{{...}}
{{{color:#de350b}[15:06:29,839][WARNING][disco-event-worker-#51%ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89%][GridDiscoveryManager]
Node FAILED: TcpDiscoveryNode [id=841c8962-775e-4109-bc74-2c003598a02d,
consistentId=A9428E28-58B8-4860-B038-9F2B392D5FC4, addrs=ArrayList
[10.212.120.189], sockAddrs=HashSet [AX07079/10.212.120.189:0], discPort=0,
order=276, intOrder=140, lastExchangeTime=1613263329640, loc=false,
ver=2.9.1#20201203-sha1:adcce517, isClient=true]{color}}}
{{[15:06:29,839][INFO][disco-event-worker-#51%ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89%][GridDiscoveryManager]
Topology snapshot [ver=290, locNode=04624fa9, servers=2, clients=14,
state=ACTIVE, CPUs=188, offheap=64.0GB, heap=97.0GB]}}
{{[15:06:29,839][INFO][disco-event-worker-#51%ignite-instance-5a07dae2-e5c5-4bcc-b865-99b2babb4d89%][GridDiscoveryManager]
^-- Baseline [id=0, size=2, online=2, offline=0]}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)