I merged to track the source of error. I found the first error message is
on nimbus.log:  "o.a.s.d.nimbus [INFO] Executor T1-1-1497424747:[8 8] not
alive". The nimbus detects some executors are not alive and thus make
reassignment which cause the worker restart.

I did not find any error in the supervisor and worker logs that would
causes an executor to be timeout. Since my topology is data-intensive, is
it possible that the heartbeat messages was not delivered in time and thus
caused the timeout of executors in nimbus? How does the an executor send
heartbeat to the nimbus? Is it through zookeeper, or just transmit via
netty like a metric tuple?

I will first try to solve the problem by increasing timeout value and will
let you guys know if it works.

Regards,
Li Wang

On 14 June 2017 at 11:27, Li Wang <[email protected]> wrote:

> Hi all,
>
> We deployed a data-intesive topology which involves in a lot of HDFS
> access via HDFS client. We found that after the topology has been executed
> for about half an hour, the topology throughput occasionally drops to zero
> for tens of seconds and sometimes the worker is shutdown without any error
> messages.
>
> I checked the log thoroughly, found nothing wrong but a info message  that
> reads “ClientCnxn [INFO] Client session timed out, have not head from
> server in 13333ms for sessioned …”. I am not sure how this message is
> related to the wired behavior of my topology. But every time my topology
> behaves abnormally, this message happens to show up in the log.
>
> Any help or suggestion is highly appreciated.
>
> Thanks,
> Li Wang.
>
>
>

Reply via email to