Re: worker always timeout to heartbeat and was restarted by supervisor

Bobby Evans Thu, 28 Aug 2014 07:29:36 -0700

Sorry I have not responded sooner, I am trying to catch up on the mailing
list.


In my experience there are only two times when a worker times out to the
supervisor.  One is when it has died, and the other is when it is stuck
doing GC and cannot get enough cpu to write out a small file.  I have seen
another really rare time when the disk filled up and a lot of bad things
started to happen.

Because the supervisor/kill complain that the process is already dead, it
indicates to me that the process crashed for some reason, and you need to
look at the worker logs to understand why it crashed.

- Bobby

On 8/20/14, 5:29 AM, "DashengJu" <[email protected]> wrote:

>---------- Forwarded message ----------
>From: DashengJu <[email protected]>
>Date: Wed, Aug 20, 2014 at 6:26 PM
>Subject: worker always timeout to heartbeat and was restarted by
>supervisor
>To: "[email protected]" <[email protected]>
>
>
>hi, all
>
>In out production environment, we have a topology named
>logparser_mobile_nginx，it has 50 worker, spout have 48 executors,
>bolt_parser have 1000 executors and bolt_saver have 50 executors.
>
>The topology running normal most of times, but 1~5 workers restarted every
>1~2 hours. When we see the logs of supervisor and worker, found 1) worker
>have no error or exception; 2) supervisor says the worker did not do
>heartbeat and timeout happened.
>
>because worker have no log, I do not know why worker did not do heartbeat.
>anyone have any ideas  how to investigate?
>0) is the worker exist caused?
>1) does it related to GC problem?
>2) does it related to Memory problem? If this, I think the JVM will report
>Memory Exception in worker log.
>
>By the way, some small topologies works well on the same environment.
>
>below is the supervisor log:
>--------------------------------------------------------------------------
>--------------------
>2014-08-20 15:51:33 b.s.d.supervisor [INFO]
>90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started
>
>2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
>state for id c7e8d375-db76-4e2
>9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
>:timed-out, Heartbeat: #backtype.sto
>rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
>"logparser_mobile_nginx-259-1408518
>662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]},
>:port 9714}
>2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
>75-db76-4e29-8019-e783ab3cd6de
>2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
>Process is probably already dead
>.
>2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
>Process is probably already dead
>.
>2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
>b76-4e29-8019-e783ab3cd6de
>
>2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
>state for id d5a8d578-89ff-4a5
>0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
>:timed-out, Heartbeat: #backtype.sto
>rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
>"logparser_nginx-265-1408521077", :
>executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
>2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
>78-89ff-4a50-a906-75e847ac63a1
>2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
>Process is probably already dead
>.
>2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
>9ff-4a50-a906-75e847ac63a1
>
>2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
>state for id 5154f643-cd79-411
>9-9368-153f1bede757. Current supervisor time: 1408521676. State:
>:timed-out, Heartbeat: #backtype.sto
>rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
>"logparser_mobile_nginx-259-1408518
>662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]},
>:port 9720}
>2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
>43-cd79-4119-9368-153f1bede757
>2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
>Process is probably already dead.
>2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
>Process is probably already dead.
>2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757
>
>2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
>state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor
>time:
>1408521676. State: :timed-out, Heartbeat:
>#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
>:storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40] [24
>24] [-1 -1]}, :port 9713}
>2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
>2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
>Process is probably already dead.
>2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
>6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
>
>-- 
>dashengju
>+86 13810875910
>[email protected]
>
>
>
>-- 
>dashengju
>+86 13810875910
>[email protected]

Re: worker always timeout to heartbeat and was restarted by supervisor

Reply via email to