Yes you are correct, I was assuming that streaming was not being used. On 8/29/14, 4:13 AM, "DashengJu" <[email protected]> wrote:
>“Because the supervisor/kill complain that the process is already dead, it >indicates to me that the process crashed for some reason, and you need to >look at the worker logs to understand why it crashed.” > >-------------------------------------------------------------------------- >-------------- >@Evans,this not indicates the process crashed. The work spawn some python >subprocess, when the supervisor/kill, it may kill the java process first, >then the python subprocess exit at the same time, so when the supervisor >try to kill subprocess pid, found the process already exit. > > >On Thu, Aug 28, 2014 at 10:28 PM, Bobby Evans ><[email protected]> >wrote: > >> Sorry I have not responded sooner, I am trying to catch up on the >>mailing >> list. >> >> In my experience there are only two times when a worker times out to the >> supervisor. One is when it has died, and the other is when it is stuck >> doing GC and cannot get enough cpu to write out a small file. I have >>seen >> another really rare time when the disk filled up and a lot of bad things >> started to happen. >> >> Because the supervisor/kill complain that the process is already dead, >>it >> indicates to me that the process crashed for some reason, and you need >>to >> look at the worker logs to understand why it crashed. >> >> - Bobby >> >> On 8/20/14, 5:29 AM, "DashengJu" <[email protected]> wrote: >> >> >---------- Forwarded message ---------- >> >From: DashengJu <[email protected]> >> >Date: Wed, Aug 20, 2014 at 6:26 PM >> >Subject: worker always timeout to heartbeat and was restarted by >> >supervisor >> >To: "[email protected]" <[email protected]> >> > >> > >> >hi, all >> > >> >In out production environment, we have a topology named >> >logparser_mobile_nginx,it has 50 worker, spout have 48 executors, >> >bolt_parser have 1000 executors and bolt_saver have 50 executors. >> > >> >The topology running normal most of times, but 1~5 workers restarted >>every >> >1~2 hours. When we see the logs of supervisor and worker, found 1) >>worker >> >have no error or exception; 2) supervisor says the worker did not do >> >heartbeat and timeout happened. >> > >> >because worker have no log, I do not know why worker did not do >>heartbeat. >> >anyone have any ideas how to investigate? >> >0) is the worker exist caused? >> >1) does it related to GC problem? >> >2) does it related to Memory problem? If this, I think the JVM will >>report >> >Memory Exception in worker log. >> > >> >By the way, some small topologies works well on the same environment. >> > >> >below is the supervisor log: >> >>>------------------------------------------------------------------------ >>>-- >> >-------------------- >> >2014-08-20 15:51:33 b.s.d.supervisor [INFO] >> >90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started >> > >> >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing >> >state for id c7e8d375-db76-4e2 >> >9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State: >> >:timed-out, Heartbeat: #backtype.sto >> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id >> >"logparser_mobile_nginx-259-1408518 >> >662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 >>-1]}, >> >:port 9714} >> >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down >> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3 >> >75-db76-4e29-8019-e783ab3cd6de >> >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901. >> >Process is probably already dead >> >. >> >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921. >> >Process is probably already dead >> >. >> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down >> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d >> >b76-4e29-8019-e783ab3cd6de >> > >> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing >> >state for id d5a8d578-89ff-4a5 >> >0-a906-75e847ac63a1. Current supervisor time: 1408521676. State: >> >:timed-out, Heartbeat: #backtype.sto >> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id >> >"logparser_nginx-265-1408521077", : >> >executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700} >> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down >> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5 >> >78-89ff-4a50-a906-75e847ac63a1 >> >2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068. >> >Process is probably already dead >> >. >> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down >> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8 >> >9ff-4a50-a906-75e847ac63a1 >> > >> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing >> >state for id 5154f643-cd79-411 >> >9-9368-153f1bede757. Current supervisor time: 1408521676. State: >> >:timed-out, Heartbeat: #backtype.sto >> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id >> >"logparser_mobile_nginx-259-1408518 >> >662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 >>-1]}, >> >:port 9720} >> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down >> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6 >> >43-cd79-4119-9368-153f1bede757 >> >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976. >> >Process is probably already dead. >> >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986. >> >Process is probably already dead. >> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down >> >>>6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede75 >>>7 >> > >> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing >> >state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor >> >time: >> >1408521676. State: :timed-out, Heartbeat: >> >#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, >> >:storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40] >>[24 >> >24] [-1 -1]}, :port 9713} >> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down >> >>>6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0b >>>a >> >2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177. >> >Process is probably already dead. >> >2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down >> >>>6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0b >>>a >> > >> >-- >> >dashengju >> >+86 13810875910 >> >[email protected] >> > >> > >> > >> >-- >> >dashengju >> >+86 13810875910 >> >[email protected] >> >> > > >-- >dashengju >+86 13810875910 >[email protected]
