Sorry I have not responded sooner, I am trying to catch up on the mailing list.
In my experience there are only two times when a worker times out to the supervisor. One is when it has died, and the other is when it is stuck doing GC and cannot get enough cpu to write out a small file. I have seen another really rare time when the disk filled up and a lot of bad things started to happen. Because the supervisor/kill complain that the process is already dead, it indicates to me that the process crashed for some reason, and you need to look at the worker logs to understand why it crashed. - Bobby On 8/20/14, 5:29 AM, "DashengJu" <[email protected]> wrote: >---------- Forwarded message ---------- >From: DashengJu <[email protected]> >Date: Wed, Aug 20, 2014 at 6:26 PM >Subject: worker always timeout to heartbeat and was restarted by >supervisor >To: "[email protected]" <[email protected]> > > >hi, all > >In out production environment, we have a topology named >logparser_mobile_nginx,it has 50 worker, spout have 48 executors, >bolt_parser have 1000 executors and bolt_saver have 50 executors. > >The topology running normal most of times, but 1~5 workers restarted every >1~2 hours. When we see the logs of supervisor and worker, found 1) worker >have no error or exception; 2) supervisor says the worker did not do >heartbeat and timeout happened. > >because worker have no log, I do not know why worker did not do heartbeat. >anyone have any ideas how to investigate? >0) is the worker exist caused? >1) does it related to GC problem? >2) does it related to Memory problem? If this, I think the JVM will report >Memory Exception in worker log. > >By the way, some small topologies works well on the same environment. > >below is the supervisor log: >-------------------------------------------------------------------------- >-------------------- >2014-08-20 15:51:33 b.s.d.supervisor [INFO] >90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started > >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing >state for id c7e8d375-db76-4e2 >9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State: >:timed-out, Heartbeat: #backtype.sto >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id >"logparser_mobile_nginx-259-1408518 >662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]}, >:port 9714} >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3 >75-db76-4e29-8019-e783ab3cd6de >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901. >Process is probably already dead >. >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921. >Process is probably already dead >. >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d >b76-4e29-8019-e783ab3cd6de > >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing >state for id d5a8d578-89ff-4a5 >0-a906-75e847ac63a1. Current supervisor time: 1408521676. State: >:timed-out, Heartbeat: #backtype.sto >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id >"logparser_nginx-265-1408521077", : >executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700} >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5 >78-89ff-4a50-a906-75e847ac63a1 >2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068. >Process is probably already dead >. >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8 >9ff-4a50-a906-75e847ac63a1 > >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing >state for id 5154f643-cd79-411 >9-9368-153f1bede757. Current supervisor time: 1408521676. State: >:timed-out, Heartbeat: #backtype.sto >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id >"logparser_mobile_nginx-259-1408518 >662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]}, >:port 9720} >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down >6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6 >43-cd79-4119-9368-153f1bede757 >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976. >Process is probably already dead. >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986. >Process is probably already dead. >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down >6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757 > >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing >state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor >time: >1408521676. State: :timed-out, Heartbeat: >#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, >:storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40] [24 >24] [-1 -1]}, :port 9713} >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down >6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba >2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177. >Process is probably already dead. >2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down >6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba > >-- >dashengju >+86 13810875910 >[email protected] > > > >-- >dashengju >+86 13810875910 >[email protected]
