[ 
https://issues.apache.org/jira/browse/IGNITE-16462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Tupitsyn updated IGNITE-16462:
------------------------------------
    Description: 
*Why*

TCP connections can enter [half-open 
state|https://en.wikipedia.org/wiki/TCP_half-open]: seems to be alive, but any 
attempt to send data will fail. Long-living and mostly idle connections are 
especially susceptible to this behavior.

Retry mechanism ([IEP-82 Thin Client Retry 
Policy|https://cwiki.apache.org/confluence/display/IGNITE/IEP-82+Thin+Client+Retry+Policy])
 in thin client implementations partially mitigates the issue. However, not all 
operations are safe to retry, and reconnect affects performance.

To improve the connection stability and detect failures early we can add a 
keep-alive mechanism.

*Why not TCP keepalive*

TCP has a [built-in keepalive 
mechanism|https://en.wikipedia.org/wiki/Keepalive], but it has some 
disadvantages:
* Optional (may be not present in some TCP stacks)
* May be not handled well by some routers (RFC 1122, section 4.2.3.6)
* Default timeout is too long (2 hours), and is problematic to adjust on SDK 
versions that are in use in Ignite (Java 8, .NET Standard 2.0), or hard to do 
right in some languages (Python, JS).

Because of that, some protocols implement keepalive logic on a higher level 
(SMB, [TLS|https://datatracker.ietf.org/doc/html/rfc6520]). More details: 
https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html

*How*

Add OP_HEARTBEAT to the protocol with an empty payload. Clients can send 
heartbeats at a configurable interval and receive responses to ensure that 
connection is active.

  was:
*Why*

TCP connections can enter [half-open 
state|https://en.wikipedia.org/wiki/TCP_half-open]: seems to be alive, but any 
attempt to send data will fail. Long-living and mostly idle connections are 
especially susceptible to this behavior.

Retry mechanism ([IEP-82 Thin Client Retry 
Policy|https://cwiki.apache.org/confluence/display/IGNITE/IEP-82+Thin+Client+Retry+Policy])
 in thin client implementations partially mitigates the issue. However, not all 
operations are safe to retry, and reconnect affects performance.

To improve the connection stability and detect failures early we can add a 
keep-alive mechanism.

*Why not TCP keepalive*

TCP has a [built-in keepalive 
mechanism|https://en.wikipedia.org/wiki/Keepalive], but it has some 
disadvantages:
* Optional (may be not present in some TCP stacks)
* May be not handled well by some routers (RFC 1122, section 4.2.3.6)
* Default timeout is too long (2 hours), and is problematic to adjust on SDK 
versions that are in use in Ignite (Java 8, .NET Standard 2.0), or hard to do 
right in some languages (Python, JS).

Because of that, some protocols implement keepalive logic on a higher level 
(SMB, TCP). More details: 
https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html


> Thin client: add keep-alive message to detect half-open connections
> -------------------------------------------------------------------
>
>                 Key: IGNITE-16462
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16462
>             Project: Ignite
>          Issue Type: Improvement
>          Components: platforms, thin client
>            Reporter: Pavel Tupitsyn
>            Assignee: Pavel Tupitsyn
>            Priority: Major
>             Fix For: 2.13
>
>
> *Why*
> TCP connections can enter [half-open 
> state|https://en.wikipedia.org/wiki/TCP_half-open]: seems to be alive, but 
> any attempt to send data will fail. Long-living and mostly idle connections 
> are especially susceptible to this behavior.
> Retry mechanism ([IEP-82 Thin Client Retry 
> Policy|https://cwiki.apache.org/confluence/display/IGNITE/IEP-82+Thin+Client+Retry+Policy])
>  in thin client implementations partially mitigates the issue. However, not 
> all operations are safe to retry, and reconnect affects performance.
> To improve the connection stability and detect failures early we can add a 
> keep-alive mechanism.
> *Why not TCP keepalive*
> TCP has a [built-in keepalive 
> mechanism|https://en.wikipedia.org/wiki/Keepalive], but it has some 
> disadvantages:
> * Optional (may be not present in some TCP stacks)
> * May be not handled well by some routers (RFC 1122, section 4.2.3.6)
> * Default timeout is too long (2 hours), and is problematic to adjust on SDK 
> versions that are in use in Ignite (Java 8, .NET Standard 2.0), or hard to do 
> right in some languages (Python, JS).
> Because of that, some protocols implement keepalive logic on a higher level 
> (SMB, [TLS|https://datatracker.ietf.org/doc/html/rfc6520]). More details: 
> https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html
> *How*
> Add OP_HEARTBEAT to the protocol with an empty payload. Clients can send 
> heartbeats at a configurable interval and receive responses to ensure that 
> connection is active.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to