Igor Kamyshnikov created IGNITE-10469:
-----------------------------------------
Summary: TcpCommunicationSpi does not break tcp connection after
IdleConnectionTimeout seconds of inactivity
Key: IGNITE-10469
URL: https://issues.apache.org/jira/browse/IGNITE-10469
Project: Ignite
Issue Type: Bug
Components: cache
Affects Versions: 2.6, 2.5
Reporter: Igor Kamyshnikov
Attachments: GridTcpCommunicationSpiIdleCommunicationTimeoutTest.java,
ignite_idle_test.zip
TcpCommunicationSpi does not close TCP connections after they have been idle
for more than configured in TcpCommunicationSpi#idleConnTimeout amount of time
(default is 10 minutes).
There are environments where idle TCP connections become unusable: connections
remain ESTABLISHED while actual data to be sent piles up in Send-Q (according
to netstat). For this reason Ignite stack does not recognize a communication
problem for a considerable amount of time (~ 10-15 minutes), and it does not
begin its reconnection procedure (hearbeats use different tcp connections that
are not idle and don't have this issue).
I've discovered though there is a logic in the Ignite code to detect and close
idle connections. But due to a problem in the code it does not work reliably.
This is a test that _sometimes_ reproduces the problem.
[^ignite_idle_test.zip] - full test project
[^GridTcpCommunicationSpiIdleCommunicationTimeoutTest.java] - just test code
What's the problem in the Ignite code?
There are two loops in the Ignite code that have a chance to close idle
connections:
1)
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.CommunicationWorker#processIdle
- this one is executed each *IdleConnectionTimeout* milliseconds. (it can
close idle connections but it typically turns out that it thinks that
connection is not idle, thanks to the second loop).
2)
org.apache.ignite.internal.util.nio.GridNioServer.AbstractNioClientWorker#bodyInternal
->
org.apache.ignite.internal.util.nio.GridNioServer.AbstractNioClientWorker#checkIdle
- this loop executes:
{noformat}
filterChain.onSessionIdleTimeout(ses); <-- does not actually close an idle
connection
// Update timestamp to avoid multiple notifications within one timeout interval.
ses.resetSendScheduleTime(); <--- resets idle timer
ses.bytesReceived(0);
{noformat}
---
To wind up, may be the whole approach should be reviewed:
- is it ok not to track message delivery time?
- is it ok not to do heartbeating using the same connections as for
get/put/... commands?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)