[jira] [Created] (FLINK-19249) Job would wait sometime(~10 min) before failover if some connection broken

Congxian Qiu(klion26) (Jira) Tue, 15 Sep 2020 07:58:42 -0700

Congxian Qiu(klion26) created FLINK-19249:
---------------------------------------------


             Summary: Job would wait sometime(~10 min) before failover if some 
connection broken
                 Key: FLINK-19249
                 URL: https://issues.apache.org/jira/browse/FLINK-19249
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Network
            Reporter: Congxian Qiu(klion26)


{quote}encountered this error on 1.7, after going through the master code, I 
think the problem is still there
{quote}
When the network environment is not so good, the connection between the server 
and the client may be disconnected innocently. After the disconnection, the 
server will receive the IOException such as below
{code:java}
java.io.IOException: Connection timed out
 at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
 at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
 at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
 at sun.nio.ch.IOUtil.write(IOUtil.java:51)
 at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:403)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:367)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
 at java.lang.Thread.run(Thread.java:748)
{code}
then release the view reader.

But the job would not fail until the downstream detect the disconnection 
because of {{channelInactive}} later(~10 min). between such time, the job can 
still process data, but the broken channel can't transfer any data or event, so 
snapshot would fail during this time. this will cause the job to replay many 
data after failover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (FLINK-19249) Job would wait sometime(~10 min) before failover if some connection broken

Reply via email to