Re: worker always timeout to heartbeat and was restarted by supervisor

Bobby Evans Thu, 04 Sep 2014 10:46:53 -0700

Yes you are correct, I was assuming that streaming was not being used.

On 8/29/14, 4:13 AM, "DashengJu" <[email protected]> wrote:


>“Because the supervisor/kill complain that the process is already dead, it
>indicates to me that the process crashed for some reason, and you need to
>look at the worker logs to understand why it crashed.”
>
>--------------------------------------------------------------------------
>--------------
>@Evans，this not indicates the process crashed. The work spawn some python
>subprocess, when the supervisor/kill, it may kill the java process first,
>then the python subprocess exit at the same time, so when the supervisor
>try to kill subprocess pid, found the process already exit.
>
>
>On Thu, Aug 28, 2014 at 10:28 PM, Bobby Evans
><[email protected]>
>wrote:
>
>> Sorry I have not responded sooner, I am trying to catch up on the
>>mailing
>> list.
>>
>> In my experience there are only two times when a worker times out to the
>> supervisor.  One is when it has died, and the other is when it is stuck
>> doing GC and cannot get enough cpu to write out a small file.  I have
>>seen
>> another really rare time when the disk filled up and a lot of bad things
>> started to happen.
>>
>> Because the supervisor/kill complain that the process is already dead,
>>it
>> indicates to me that the process crashed for some reason, and you need
>>to
>> look at the worker logs to understand why it crashed.
>>
>> - Bobby
>>
>> On 8/20/14, 5:29 AM, "DashengJu" <[email protected]> wrote:
>>
>> >---------- Forwarded message ----------
>> >From: DashengJu <[email protected]>
>> >Date: Wed, Aug 20, 2014 at 6:26 PM
>> >Subject: worker always timeout to heartbeat and was restarted by
>> >supervisor
>> >To: "[email protected]" <[email protected]>
>> >
>> >
>> >hi, all
>> >
>> >In out production environment, we have a topology named
>> >logparser_mobile_nginx，it has 50 worker, spout have 48 executors,
>> >bolt_parser have 1000 executors and bolt_saver have 50 executors.
>> >
>> >The topology running normal most of times, but 1~5 workers restarted
>>every
>> >1~2 hours. When we see the logs of supervisor and worker, found 1)
>>worker
>> >have no error or exception; 2) supervisor says the worker did not do
>> >heartbeat and timeout happened.
>> >
>> >because worker have no log, I do not know why worker did not do
>>heartbeat.
>> >anyone have any ideas  how to investigate?
>> >0) is the worker exist caused?
>> >1) does it related to GC problem?
>> >2) does it related to Memory problem? If this, I think the JVM will
>>report
>> >Memory Exception in worker log.
>> >
>> >By the way, some small topologies works well on the same environment.
>> >
>> >below is the supervisor log:
>> 
>>>------------------------------------------------------------------------
>>>--
>> >--------------------
>> >2014-08-20 15:51:33 b.s.d.supervisor [INFO]
>> >90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started
>> >
>> >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
>> >state for id c7e8d375-db76-4e2
>> >9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
>> >:timed-out, Heartbeat: #backtype.sto
>> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
>> >"logparser_mobile_nginx-259-1408518
>> >662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1
>>-1]},
>> >:port 9714}
>> >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
>> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
>> >75-db76-4e29-8019-e783ab3cd6de
>> >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
>> >Process is probably already dead
>> >.
>> >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
>> >Process is probably already dead
>> >.
>> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
>> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
>> >b76-4e29-8019-e783ab3cd6de
>> >
>> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
>> >state for id d5a8d578-89ff-4a5
>> >0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
>> >:timed-out, Heartbeat: #backtype.sto
>> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
>> >"logparser_nginx-265-1408521077", :
>> >executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
>> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
>> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
>> >78-89ff-4a50-a906-75e847ac63a1
>> >2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
>> >Process is probably already dead
>> >.
>> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
>> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
>> >9ff-4a50-a906-75e847ac63a1
>> >
>> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
>> >state for id 5154f643-cd79-411
>> >9-9368-153f1bede757. Current supervisor time: 1408521676. State:
>> >:timed-out, Heartbeat: #backtype.sto
>> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
>> >"logparser_mobile_nginx-259-1408518
>> >662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1
>>-1]},
>> >:port 9720}
>> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
>> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
>> >43-cd79-4119-9368-153f1bede757
>> >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
>> >Process is probably already dead.
>> >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
>> >Process is probably already dead.
>> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
>> 
>>>6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede75
>>>7
>> >
>> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
>> >state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor
>> >time:
>> >1408521676. State: :timed-out, Heartbeat:
>> >#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
>> >:storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40]
>>[24
>> >24] [-1 -1]}, :port 9713}
>> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
>> 
>>>6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0b
>>>a
>> >2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
>> >Process is probably already dead.
>> >2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
>> 
>>>6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0b
>>>a
>> >
>> >--
>> >dashengju
>> >+86 13810875910
>> >[email protected]
>> >
>> >
>> >
>> >--
>> >dashengju
>> >+86 13810875910
>> >[email protected]
>>
>>
>
>
>-- 
>dashengju
>+86 13810875910
>[email protected]

Re: worker always timeout to heartbeat and was restarted by supervisor

Reply via email to