[jira] [Comment Edited] (FLINK-15074) Connection timed out, Standalone cluster

Piotr Nowojski (Jira) Fri, 13 Dec 2019 09:42:23 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995784#comment-16995784
 ]


Piotr Nowojski edited comment on FLINK-15074 at 12/13/19 5:41 PM:
------------------------------------------------------------------

Thanks [~gameking] for the bug report and sorry for the delay in responding!

I think it's unlikely this being a bug in Flink itself. Also this one 
taskmanager.log is unfortunately not very helpful, as connection timeout is 
only a symptom of an issue somewhere else in the cluster. You would have to 
take a look at the task manager logs, stdout/stderr and system logs on other 
machines to find the underlying issue. For some reason one of the other task 
managers either stopped responding at all, or stopped responding for some 
period of time. There are couple of common issues that I would suggest to rule 
out first (more or less in that order):
 # check if another machine hasn't crashed/rebooted
 # check if another Task Manager process hasn't crashed. This might includes 
some exceptions that forced Task Manager to shut down, including JVM fatal 
errors (OOM), potential segfaults (JVM bugs, native libraries like RocksDB 
issues)
 # check if another Task Manager wasn't killed (system OOM killer, ...)
 # check for GC pauses (make sure you are using G1GC) - stop the world GC pause 
can easily create connection time outs for another machines. Connect with some 
Java monitoring/profiling tool to check GC times, or print GC logs - this is my 
personal bet, as increase parallelism could easily increase GC pressure
 # check if JVM is not blocked for another reasons, like machine swapping, or 
heavy disk IO usage (long blocking IO can block arbitrarily java threads for 
example due to logging). This might be difficult to diagnose, but you can start 
from making sure that you log something every X seconds and look for suspicious 
gaps in the task manager logs.
 # make sure that your network is stable and not overloaded (run {{ping}} 
command in parallel and dump the output to a separate file)


was (Author: pnowojski):
Thanks for the bug report and sorry for the delay in responding!

I think it's unlikely this being a bug in Flink itself. Also this one 
taskmanager.log is unfortunately not very helpful, as connection timeout is 
only a symptom of an issue somewhere else in the cluster. You would have to 
take a look at the task manager logs, stdout/stderr and system logs on other 
machines to find the underlying issue. For some reason one of the other task 
managers either stopped responding at all, or stopped responding for some 
period of time. There are couple of common issues that I would suggest to rule 
out first (more or less in that order):
 # check if another machine hasn't crashed/rebooted
 # check if another Task Manager process hasn't crashed. This might includes 
some exceptions that forced Task Manager to shut down, including JVM fatal 
errors (OOM), potential segfaults (JVM bugs, native libraries like RocksDB 
issues)
 # check if another Task Manager wasn't killed (system OOM killer, ...)
 # check for GC pauses (make sure you are using G1GC) - stop the world GC pause 
can easily create connection time outs for another machines. Connect with some 
Java monitoring/profiling tool to check GC times, or print GC logs - this is my 
personal bet, as increase parallelism could easily increase GC pressure
 # check if JVM is not blocked for another reasons, like machine swapping, or 
heavy disk IO usage (long blocking IO can block arbitrarily java threads for 
example due to logging). This might be difficult to diagnose, but you can start 
from making sure that you log something every X seconds and look for suspicious 
gaps in the task manager logs.
 # make sure that your network is stable and not overloaded (run {{ping}} 
command in parallel and dump the output to a separate file)

> Connection timed out, Standalone cluster
> ----------------------------------------
>
>                 Key: FLINK-15074
>                 URL: https://issues.apache.org/jira/browse/FLINK-15074
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.9.1
>         Environment: flink version : 1.5.1 , 1.9.1
> jdk version : 1.8.0_181
> Number of servers : 15
> Number of taskmanagers : 178
> Number of slots: 178
>            Reporter: gameking
>            Priority: Major
>         Attachments: flink-conf.yaml, jobmanager.log, taskmanager.log
>
>
> I am running a flink streaming application on  a standalone-cluster.
> It works well when the job's parallelism is low, just like 96.
> But when I try to increase job's parallelism  to a high value, like 164 or 
> more,  Job will fail in 10-15 minutes due to connection timeout error
> I have try to solve this problem by increaseing taskmanager configs just like 
> 'taskmanager.network.netty.server.numThreads', 
> 'taskmanager.network.netty.client.numThreads', 
> 'taskmanager.network.request-backoff.max', 'akka.ask.timeout' and so on, It 
> doesn't work.
> I also try to change different versions of flink, such as 1.5.1 and 1.9.1, to 
> solve this problem , it doesn't help too. 
> Does anyone know how to fix this problem，I have no idea now. It looks like a 
> bug.
> I hava upload my config and log as attachment, and the error trace below ：
>  
> ------------------------------------------------------------------
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
> Connection timed out
>  at 
> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:172)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
> Caused by: java.io.IOException: Connection timed out
>  at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.8.0_181]
>  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.8.0_181]
>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.8.0_181]
>  at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.8.0_181]
>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) 
> ~[na:1.8.0_181]
>  at 
> org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at 
> org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>  ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  ... 6 common frames omitted



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-15074) Connection timed out, Standalone cluster

Reply via email to