[ 
https://issues.apache.org/jira/browse/FLINK-19249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204453#comment-17204453
 ] 

Zhijiang commented on FLINK-19249:
----------------------------------

Thanks for reporting this [~klion26] and all the above discussions.

I am curious of why the downstream side is aware of this problem after ten 
minutes in network stack.
As we known, when the netty server(upstream) detects the network physical 
problem, it will do below two things:

* Send the ErrorResponse message to netty client(downstream);
* Close the channel explicitly on its side after sending above message.

So the downstream side actually relies on two mechanisms for failure detection 
and handling:
* Logic ErrorResponse message from upstream side, if the downstream can receive 
it from network, then it will fail itself.
* Physical kernel mechanism: while upstream closing the local channel, the 
downstream side will also detect this inactive channel after some time(TCP 
mechanism), and then fail itself via operating `#handler#channelInactive` for 
example.

If the above two mechanisms are not alway reliable in some bad network 
environment, or delay because of kernel default setting, then we might provide 
another application mechanism to resolve it for safety. I can think of a 
previously discussed option to let upstream report this network exception to 
JobManager side in RPC, then the manager can decide to cancel/fail the related 
tasks.

Regarding the other options as `ReadTimeOutHandle/IdelStateHandle`, I am 
wondering they might bring other side effects and also not always reliable or 
limited by network stack.

> Job would wait sometime(~10 min) before failover if some connection broken
> --------------------------------------------------------------------------
>
>                 Key: FLINK-19249
>                 URL: https://issues.apache.org/jira/browse/FLINK-19249
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>            Reporter: Congxian Qiu(klion26)
>            Priority: Blocker
>             Fix For: 1.12.0, 1.11.3
>
>
> {quote}encountered this error on 1.7, after going through the master code, I 
> think the problem is still there
> {quote}
> When the network environment is not so good, the connection between the 
> server and the client may be disconnected innocently. After the 
> disconnection, the server will receive the IOException such as below
> {code:java}
> java.io.IOException: Connection timed out
>  at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>  at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>  at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>  at sun.nio.ch.IOUtil.write(IOUtil.java:51)
>  at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:403)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:367)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>  at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> then release the view reader.
> But the job would not fail until the downstream detect the disconnection 
> because of {{channelInactive}} later(~10 min). between such time, the job can 
> still process data, but the broken channel can't transfer any data or event, 
> so snapshot would fail during this time. this will cause the job to replay 
> many data after failover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to