[jira] [Commented] (FLINK-19249) Job would wait sometime(~10 min) before failover if some connection broken

Piotr Nowojski (Jira) Tue, 29 Sep 2020 08:53:14 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-19249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204047#comment-17204047
 ]


Piotr Nowojski commented on FLINK-19249:
----------------------------------------

I'm not sure how {{ReadTimeOutHandle/IdelStateHandle}} are working, but our 
code running inside the netty threads is not non blocking, so I would be afraid 
we can not use this. For the idle channels we could introduce keep alive 
messages, but this again would hit the same problem with blocking operations in 
the netty threads.

Kernel handled TCP-keepalive flag would have greater chance of working 
reliably. 

Does this have to be a blocker issue? It's a rare pre-existing behaviour, 
that's triggered by cluster's unstable network. This is also as far as I see 
it, working as designed, so I wouldn't label this even a bug, but feature 
request/improvement.

> Job would wait sometime(~10 min) before failover if some connection broken
> --------------------------------------------------------------------------
>
>                 Key: FLINK-19249
>                 URL: https://issues.apache.org/jira/browse/FLINK-19249
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>            Reporter: Congxian Qiu(klion26)
>            Priority: Blocker
>             Fix For: 1.12.0, 1.11.3
>
>
> {quote}encountered this error on 1.7, after going through the master code, I 
> think the problem is still there
> {quote}
> When the network environment is not so good, the connection between the 
> server and the client may be disconnected innocently. After the 
> disconnection, the server will receive the IOException such as below
> {code:java}
> java.io.IOException: Connection timed out
>  at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>  at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>  at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>  at sun.nio.ch.IOUtil.write(IOUtil.java:51)
>  at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:403)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:367)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:639)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>  at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>  at java.lang.Thread.run(Thread.java:748)
> {code}
> then release the view reader.
> But the job would not fail until the downstream detect the disconnection 
> because of {{channelInactive}} later(~10 min). between such time, the job can 
> still process data, but the broken channel can't transfer any data or event, 
> so snapshot would fail during this time. this will cause the job to replay 
> many data after failover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19249) Job would wait sometime(~10 min) before failover if some connection broken

Reply via email to