[
https://issues.apache.org/jira/browse/IGNITE-10469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stanilovsky Evgeny resolved IGNITE-10469.
-----------------------------------------
Resolution: Cannot Reproduce
Fix Version/s: 2.7
Igor, i change status, feel free to reopen if problem still actual in 2.7.
> TcpCommunicationSpi does not break tcp connection after IdleConnectionTimeout
> seconds of inactivity
> ---------------------------------------------------------------------------------------------------
>
> Key: IGNITE-10469
> URL: https://issues.apache.org/jira/browse/IGNITE-10469
> Project: Ignite
> Issue Type: Bug
> Components: cache
> Affects Versions: 2.5, 2.6
> Reporter: Igor Kamyshnikov
> Assignee: Stanilovsky Evgeny
> Priority: Major
> Fix For: 2.7
>
> Attachments: 2.6.0.txt,
> GridTcpCommunicationSpiIdleCommunicationTimeoutTest.java,
> GridTcpCommunicationSpiIdleCommunicationTimeoutTest.java, ignite_idle_test.zip
>
>
> TcpCommunicationSpi does not close TCP connections after they have been idle
> for more than configured in TcpCommunicationSpi#idleConnTimeout amount of
> time (default is 10 minutes).
> There are environments where idle TCP connections become unusable:
> connections remain ESTABLISHED while actual data to be sent piles up in
> Send-Q (according to netstat). For this reason Ignite stack does not
> recognize a communication problem for a considerable amount of time (~ 10-15
> minutes), and it does not begin its reconnection procedure (hearbeats use
> different tcp connections that are not idle and don't have this issue).
> I've discovered though there is a logic in the Ignite code to detect and
> close idle connections. But due to a problem in the code it does not work
> reliably.
> This is a test that _sometimes_ reproduces the problem.
> [^ignite_idle_test.zip] - full test project
> [^GridTcpCommunicationSpiIdleCommunicationTimeoutTest.java] - just test code
> [^2.6.0.txt] - mvn clean install logs for test with Ignite 2.6.0
> What's the problem in the Ignite code?
> There are two loops in the Ignite code that have a chance to close idle
> connections:
> 1)
> org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.CommunicationWorker#processIdle
> - this one is executed each *IdleConnectionTimeout* milliseconds. (it can
> close idle connections but it typically turns out that it thinks that
> connection is not idle, thanks to the second loop).
> 2)
> org.apache.ignite.internal.util.nio.GridNioServer.AbstractNioClientWorker#bodyInternal
> ->
> org.apache.ignite.internal.util.nio.GridNioServer.AbstractNioClientWorker#checkIdle
> - this loop executes:
> {noformat}
> filterChain.onSessionIdleTimeout(ses); <-- does not actually close an idle
> connection
> // Update timestamp to avoid multiple notifications within one timeout
> interval.
> ses.resetSendScheduleTime(); <--- resets idle timer
> ses.bytesReceived(0);
> {noformat}
> ---
> To wind up, may be the whole approach should be reviewed:
> - is it ok not to track message delivery time?
> - is it ok not to do heartbeating using the same connections as for
> get/put/... commands?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)