[
https://issues.apache.org/jira/browse/FLINK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494359#comment-17494359
]
Cai Liuyang commented on FLINK-26080:
-------------------------------------
Yeah, looks like the same problem, thks [~pnowojski] , i'll read this issue
carefully~
> PartitionRequest client use Netty's IdleStateHandler to monitor channel's
> status
> --------------------------------------------------------------------------------
>
> Key: FLINK-26080
> URL: https://issues.apache.org/jira/browse/FLINK-26080
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Network
> Affects Versions: 1.14.3
> Reporter: Cai Liuyang
> Priority: Major
>
> In out production environment, we encounter one abnormal case:
> upstreamTask is backpressured but its all donwStreamTask is idle, job
> will keep this status until chk is timeout(use aligned chk); After we analyse
> this case, we found Half-opend-socket (see
> [https://en.wikipedia.org/wiki/TCP_half-open] ) which is already closed on
> server side but established on client side,lead to this:
> 1. NettyServer encounter ReadTimeoutException when read data from
> channel, then it will release the NetworkSequenceViewReader (which is
> responsable to send data to PartitionRequestClient) and write ErrorResponse
> to PartitionRequestClient. After writing ErrorResponse success, server will
> close the channel (socket will be transformed to fin_wait1 status)
> 2. PartitionRequestClient doesn't receive the ErrorResponse and server's
> FIN, so client will keep socket be establised status and waiting for
> BufferResponse from server (maybe our machine's kernel-bug lead to
> ErrorResponse and FIN lost )
> 3. Server machine will release the socket if it keep fin_wait1 status for
> two long time, but the socket on client machine is also under established
> status, and so lead to Half-opened-socket
> To avoid this case,I think there are two methods:
> 1. Client enable TCP keep alive(flink is already enabled): this way
> should also need adjust machine's tcp-keep-alive time (tcp-keep-alive's
> default time is 7200 seconds, which is two long).
> 2. Client use netty‘s IdleStateHandler to detect whether channel is
> idle(read or write), if channel is idle, client will try to write pingMsg to
> server to detect whether channel is really ok.
> For the two methods, i recommend the method-2, because adjustment of
> machine's tcp-keep-alive time will have an impact on other service running on
> the same machine
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)