[ 
https://issues.apache.org/jira/browse/FLINK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494359#comment-17494359
 ] 

Cai Liuyang commented on FLINK-26080:
-------------------------------------

Yeah, looks like the same problem, thks [~pnowojski] , i'll read this issue 
carefully~

 

> PartitionRequest client use Netty's IdleStateHandler to monitor channel's 
> status
> --------------------------------------------------------------------------------
>
>                 Key: FLINK-26080
>                 URL: https://issues.apache.org/jira/browse/FLINK-26080
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>    Affects Versions: 1.14.3
>            Reporter: Cai Liuyang
>            Priority: Major
>
> In out production environment, we encounter one abnormal case:
>     upstreamTask is backpressured but its all donwStreamTask is idle, job 
> will keep this status until chk is timeout(use aligned chk); After we analyse 
> this case, we found Half-opend-socket (see 
> [https://en.wikipedia.org/wiki/TCP_half-open] ) which is  already closed on 
> server side but established on client side,lead to this: 
>     1. NettyServer encounter ReadTimeoutException when read data from 
> channel, then it will release the NetworkSequenceViewReader (which is 
> responsable to send data to PartitionRequestClient) and write ErrorResponse 
> to PartitionRequestClient. After writing ErrorResponse success, server will 
> close the channel (socket will be transformed to fin_wait1 status)
>     2. PartitionRequestClient doesn't receive the ErrorResponse and server's 
> FIN, so client will keep socket be establised status and waiting for 
> BufferResponse from server (maybe our machine's kernel-bug lead to 
> ErrorResponse and FIN lost )
>     3. Server machine will release the socket if it keep fin_wait1 status for 
> two long time, but the socket on client machine is also under established 
> status, and so lead to Half-opened-socket
> To avoid this case,I think there are two methods:
>     1. Client enable TCP keep alive(flink is already enabled): this way 
> should also need adjust machine's tcp-keep-alive time (tcp-keep-alive's 
> default time is 7200 seconds, which is two long).
>     2. Client use netty‘s IdleStateHandler to detect whether channel is 
> idle(read or write), if channel is idle, client will try to write pingMsg to 
> server to detect whether channel is really ok.
> For the two methods, i recommend the method-2, because adjustment of 
> machine's tcp-keep-alive time will have an impact on other service running on 
> the same machine
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to