I merged to track the source of error. I found the first error message is on nimbus.log: "o.a.s.d.nimbus [INFO] Executor T1-1-1497424747:[8 8] not alive". The nimbus detects some executors are not alive and thus make reassignment which cause the worker restart.
I did not find any error in the supervisor and worker logs that would causes an executor to be timeout. Since my topology is data-intensive, is it possible that the heartbeat messages was not delivered in time and thus caused the timeout of executors in nimbus? How does the an executor send heartbeat to the nimbus? Is it through zookeeper, or just transmit via netty like a metric tuple? I will first try to solve the problem by increasing timeout value and will let you guys know if it works. Regards, Li Wang On 14 June 2017 at 11:27, Li Wang <[email protected]> wrote: > Hi all, > > We deployed a data-intesive topology which involves in a lot of HDFS > access via HDFS client. We found that after the topology has been executed > for about half an hour, the topology throughput occasionally drops to zero > for tens of seconds and sometimes the worker is shutdown without any error > messages. > > I checked the log thoroughly, found nothing wrong but a info message that > reads “ClientCnxn [INFO] Client session timed out, have not head from > server in 13333ms for sessioned …”. I am not sure how this message is > related to the wired behavior of my topology. But every time my topology > behaves abnormally, this message happens to show up in the log. > > Any help or suggestion is highly appreciated. > > Thanks, > Li Wang. > > >
