[ https://issues.apache.org/jira/browse/FLINK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cai Liuyang updated FLINK-26080: -------------------------------- Description: In out production environment, we encounter one abnormal case: upstreamTask is backpressured but its all donwStreamTask is idle, job will keep this status until chk is timeout(use aligned chk); After we analyse this case, we found[ half-opend socket|[https://en.wikipedia.org/wiki/TCP_half-open],] which is already closed on server side, but established on client side 1. NettyServer encounter ReadTimeoutException when read data from channel, then it will release the NetworkSequenceViewReader (which is responsable to send data to PartitionRequestClient) and write ErrorResponse to PartitionRequestClient; 2. PartitionRequestClient doesn't receive the ErrorResponse (maybe network congestion or our machine's kernel-bug lead to this) 3. NettyServer after write ErrorResponse, it will close the channel (socket will be transformed to fin_wait1 status), but client machine doesn't receive the Server's fin, so it will treat the channel is ok, and will keep waiting for server's BufferReponse (But server is already release correlative NetworkSequenceViewReader) 4. Server machine will release the socket if it keep fin_wait1 status for two long time, but the socket on client machine is also under established status. To avoid this case,I think there are two methods: 1. Client enable TCP keep alive(flink is already enabled): this way should also need adjust machine's tcp-keep-alive time (tcp-keep-alive's default time is 7200 seconds, which is two long). 2. Client use netty‘s IdleStateHandler to detect whether channel is idle(read or write), if channel is idle, client will try to write pingMsg to server to detect whether channel is really ok. For the two methods, i recommend the method-2, because adjustment of machine's tcp-keep-alive time will have an impact on other service running on the same machine was: In out production environment, we encounter one abnormal case: upstreamTask is backpressured but its all donwStreamTask is idle, job will keep this status until chk is timeout(use aligned chk); After we analyse this case, we found the reason: (Machine's kernel we used may have bug that will lost socket event ) 1. NettyServer encounter ReadTimeoutException when read data from channel, then it will release the NetworkSequenceViewReader (which is responsable to send data to PartitionRequestClient) and write ErrorResponse to PartitionRequestClient; 2. PartitionRequestClient doesn't receive the ErrorResponse (maybe network congestion or our machine's kernel-bug lead to this) 3. NettyServer after write ErrorResponse, it will close the channel (socket will be transformed to fin_wait1 status), but client machine doesn't receive the Server's fin, so it will treat the channel is ok, and will keep waiting for server's BufferReponse (But server is already release correlative NetworkSequenceViewReader) 4. Server machine will release the socket if it keep fin_wait1 status for two long time, but the socket on client machine is also under established status. To avoid this case,I think there are two methods: 1. Client enable TCP keep alive(flink is already enabled): this way should also need adjust machine's tcp-keep-alive time (tcp-keep-alive's default time is 7200 seconds, which is two long). 2. Client use netty‘s IdleStateHandler to detect whether channel is idle(read or write), if channel is idle, client will try to write pingMsg to server to detect whether channel is really ok. For the two methods, i recommend the method-2, because adjustment of machine's tcp-keep-alive time will have an impact on other service running on the same machine > PartitionRequest client use Netty's IdleStateHandler to monitor channel's > status > -------------------------------------------------------------------------------- > > Key: FLINK-26080 > URL: https://issues.apache.org/jira/browse/FLINK-26080 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network > Affects Versions: 1.14.3 > Reporter: Cai Liuyang > Priority: Major > > In out production environment, we encounter one abnormal case: > upstreamTask is backpressured but its all donwStreamTask is idle, job > will keep this status until chk is timeout(use aligned chk); After we analyse > this case, we found[ half-opend > socket|[https://en.wikipedia.org/wiki/TCP_half-open],] which is already > closed on server side, but established on client side > 1. NettyServer encounter ReadTimeoutException when read data from > channel, then it will release the NetworkSequenceViewReader (which is > responsable to send data to PartitionRequestClient) and write ErrorResponse > to PartitionRequestClient; > 2. PartitionRequestClient doesn't receive the ErrorResponse (maybe > network congestion or our machine's kernel-bug lead to this) > 3. NettyServer after write ErrorResponse, it will close the channel > (socket will be transformed to fin_wait1 status), but client machine doesn't > receive the Server's fin, so it will treat the channel is ok, and will keep > waiting for server's BufferReponse (But server is already release correlative > NetworkSequenceViewReader) > 4. Server machine will release the socket if it keep fin_wait1 status for > two long time, but the socket on client machine is also under established > status. > To avoid this case,I think there are two methods: > 1. Client enable TCP keep alive(flink is already enabled): this way > should also need adjust machine's tcp-keep-alive time (tcp-keep-alive's > default time is 7200 seconds, which is two long). > 2. Client use netty‘s IdleStateHandler to detect whether channel is > idle(read or write), if channel is idle, client will try to write pingMsg to > server to detect whether channel is really ok. > For the two methods, i recommend the method-2, because adjustment of > machine's tcp-keep-alive time will have an impact on other service running on > the same machine > -- This message was sent by Atlassian Jira (v8.20.1#820001)