[ 
https://issues.apache.org/jira/browse/FLINK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cai Liuyang updated FLINK-26080:
--------------------------------
    Description: 
In out production environment, we encounter one abnormal case:

    upstreamTask is backpressured but its all donwStreamTask is idle, job will 
keep this status until chk is timeout(use aligned chk); After we analyse this 
case, we found[ half-opend 
socket|[https://en.wikipedia.org/wiki/TCP_half-open],] which is  already closed 
on server side, but established on client side

    1. NettyServer encounter ReadTimeoutException when read data from channel, 
then it will release the NetworkSequenceViewReader (which is responsable to 
send data to PartitionRequestClient) and write ErrorResponse to 
PartitionRequestClient;

    2. PartitionRequestClient doesn't receive the ErrorResponse (maybe network 
congestion or our machine's kernel-bug lead to this)

    3. NettyServer after write ErrorResponse, it will close the channel (socket 
will be transformed to fin_wait1 status), but client machine doesn't receive 
the Server's fin, so it will treat the channel is ok, and will keep waiting for 
server's BufferReponse (But server is already release correlative 
NetworkSequenceViewReader)

    4. Server machine will release the socket if it keep fin_wait1 status for 
two long time, but the socket on client machine is also under established 
status.

To avoid this case,I think there are two methods:

    1. Client enable TCP keep alive(flink is already enabled): this way should 
also need adjust machine's tcp-keep-alive time (tcp-keep-alive's default time 
is 7200 seconds, which is two long).

    2. Client use netty‘s IdleStateHandler to detect whether channel is 
idle(read or write), if channel is idle, client will try to write pingMsg to 
server to detect whether channel is really ok.

For the two methods, i recommend the method-2, because adjustment of machine's 
tcp-keep-alive time will have an impact on other service running on the same 
machine

 

  was:
In out production environment, we encounter one abnormal case:

    upstreamTask is backpressured but its all donwStreamTask is idle, job will 
keep this status until chk is timeout(use aligned chk); After we analyse this 
case, we found the reason:  (Machine's kernel we used may have bug that will 
lost socket event )   

    1. NettyServer encounter ReadTimeoutException when read data from channel, 
then it will release the NetworkSequenceViewReader (which is responsable to 
send data to PartitionRequestClient) and write ErrorResponse to 
PartitionRequestClient;

    2. PartitionRequestClient doesn't receive the ErrorResponse (maybe network 
congestion or our machine's kernel-bug lead to this)

    3. NettyServer after write ErrorResponse, it will close the channel (socket 
will be transformed to fin_wait1 status), but client machine doesn't receive 
the Server's fin, so it will treat the channel is ok, and will keep waiting for 
server's BufferReponse (But server is already release correlative 
NetworkSequenceViewReader)

    4. Server machine will release the socket if it keep fin_wait1 status for 
two long time, but the socket on client machine is also under established 
status.

To avoid this case,I think there are two methods:

    1. Client enable TCP keep alive(flink is already enabled): this way should 
also need adjust machine's tcp-keep-alive time (tcp-keep-alive's default time 
is 7200 seconds, which is two long).

    2. Client use netty‘s IdleStateHandler to detect whether channel is 
idle(read or write), if channel is idle, client will try to write pingMsg to 
server to detect whether channel is really ok.

For the two methods, i recommend the method-2, because adjustment of machine's 
tcp-keep-alive time will have an impact on other service running on the same 
machine

 


> PartitionRequest client use Netty's IdleStateHandler to monitor channel's 
> status
> --------------------------------------------------------------------------------
>
>                 Key: FLINK-26080
>                 URL: https://issues.apache.org/jira/browse/FLINK-26080
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>    Affects Versions: 1.14.3
>            Reporter: Cai Liuyang
>            Priority: Major
>
> In out production environment, we encounter one abnormal case:
>     upstreamTask is backpressured but its all donwStreamTask is idle, job 
> will keep this status until chk is timeout(use aligned chk); After we analyse 
> this case, we found[ half-opend 
> socket|[https://en.wikipedia.org/wiki/TCP_half-open],] which is  already 
> closed on server side, but established on client side
>     1. NettyServer encounter ReadTimeoutException when read data from 
> channel, then it will release the NetworkSequenceViewReader (which is 
> responsable to send data to PartitionRequestClient) and write ErrorResponse 
> to PartitionRequestClient;
>     2. PartitionRequestClient doesn't receive the ErrorResponse (maybe 
> network congestion or our machine's kernel-bug lead to this)
>     3. NettyServer after write ErrorResponse, it will close the channel 
> (socket will be transformed to fin_wait1 status), but client machine doesn't 
> receive the Server's fin, so it will treat the channel is ok, and will keep 
> waiting for server's BufferReponse (But server is already release correlative 
> NetworkSequenceViewReader)
>     4. Server machine will release the socket if it keep fin_wait1 status for 
> two long time, but the socket on client machine is also under established 
> status.
> To avoid this case,I think there are two methods:
>     1. Client enable TCP keep alive(flink is already enabled): this way 
> should also need adjust machine's tcp-keep-alive time (tcp-keep-alive's 
> default time is 7200 seconds, which is two long).
>     2. Client use netty‘s IdleStateHandler to detect whether channel is 
> idle(read or write), if channel is idle, client will try to write pingMsg to 
> server to detect whether channel is really ok.
> For the two methods, i recommend the method-2, because adjustment of 
> machine's tcp-keep-alive time will have an impact on other service running on 
> the same machine
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to