Re: worker always timeout to heartbeat and was restarted by supervisor

DashengJu Fri, 29 Aug 2014 02:04:19 -0700

Thanks for your reply.

We have solve the problem. I think the main problem is GC caused.
We changed the worker.childopts params and the problem never happens.


------------------------------------------------
before config: -Xmx 1024m
------------------------------------------------
after config: -Xmx2600m -Xms2600m -Xss256k -XX:MaxPermSize=128m
-XX:PermSize=96m -XX:NewSize=1000m -XX:MaxNewSize=1000m
-XX:MaxTenuringThreshold=1 -XX:SurvivorRatio=6 -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-server -XX:+AggressiveOpts -XX:+UseCompressedOops -Djava.awt.headless=true
-Djava.net.preferIPv4Stack=true -Xloggc:logs/gc-worker-%ID%.log -verbose:gc
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10m
-XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps
-XX:+PrintClassHistogram -XX:+PrintTenuringDistribution
-XX:-PrintGCApplicationStoppedTime -XX:-PrintGCApplicationConcurrentTime
-XX:+PrintCommandLineFlags -XX:+PrintFlagsFinal


On Thu, Aug 28, 2014 at 10:28 PM, Bobby Evans <[email protected]>
wrote:

> Sorry I have not responded sooner, I am trying to catch up on the mailing
> list.
>
> In my experience there are only two times when a worker times out to the
> supervisor.  One is when it has died, and the other is when it is stuck
> doing GC and cannot get enough cpu to write out a small file.  I have seen
> another really rare time when the disk filled up and a lot of bad things
> started to happen.
>
> Because the supervisor/kill complain that the process is already dead, it
> indicates to me that the process crashed for some reason, and you need to
> look at the worker logs to understand why it crashed.
>
> - Bobby
>
> On 8/20/14, 5:29 AM, "DashengJu" <[email protected]> wrote:
>
> >---------- Forwarded message ----------
> >From: DashengJu <[email protected]>
> >Date: Wed, Aug 20, 2014 at 6:26 PM
> >Subject: worker always timeout to heartbeat and was restarted by
> >supervisor
> >To: "[email protected]" <[email protected]>
> >
> >
> >hi, all
> >
> >In out production environment, we have a topology named
> >logparser_mobile_nginx，it has 50 worker, spout have 48 executors,
> >bolt_parser have 1000 executors and bolt_saver have 50 executors.
> >
> >The topology running normal most of times, but 1~5 workers restarted every
> >1~2 hours. When we see the logs of supervisor and worker, found 1) worker
> >have no error or exception; 2) supervisor says the worker did not do
> >heartbeat and timeout happened.
> >
> >because worker have no log, I do not know why worker did not do heartbeat.
> >anyone have any ideas  how to investigate?
> >0) is the worker exist caused?
> >1) does it related to GC problem?
> >2) does it related to Memory problem? If this, I think the JVM will report
> >Memory Exception in worker log.
> >
> >By the way, some small topologies works well on the same environment.
> >
> >below is the supervisor log:
> >--------------------------------------------------------------------------
> >--------------------
> >2014-08-20 15:51:33 b.s.d.supervisor [INFO]
> >90facad7-c666-41da-b7c5-f147ebe35542 still hasn't started
> >
> >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down and clearing
> >state for id c7e8d375-db76-4e2
> >9-8019-e783ab3cd6de. Current supervisor time: 1408521676. State:
> >:timed-out, Heartbeat: #backtype.sto
> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
> >"logparser_mobile_nginx-259-1408518
> >662", :executors #{[4 4] [104 104] [204 204] [54 54] [154 154] [-1 -1]},
> >:port 9714}
> >2014-08-20 16:01:16 b.s.d.supervisor [INFO] Shutting down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d3
> >75-db76-4e29-8019-e783ab3cd6de
> >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44901.
> >Process is probably already dead
> >.
> >2014-08-20 16:01:17 b.s.util [INFO] Error when trying to kill 44921.
> >Process is probably already dead
> >.
> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shut down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:c7e8d375-d
> >b76-4e29-8019-e783ab3cd6de
> >
> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down and clearing
> >state for id d5a8d578-89ff-4a5
> >0-a906-75e847ac63a1. Current supervisor time: 1408521676. State:
> >:timed-out, Heartbeat: #backtype.sto
> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521645, :storm-id
> >"logparser_nginx-265-1408521077", :
> >executors #{[50 50] [114 114] [178 178] [-1 -1]}, :port 9700}
> >2014-08-20 16:01:17 b.s.d.supervisor [INFO] Shutting down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d5
> >78-89ff-4a50-a906-75e847ac63a1
> >2014-08-20 16:01:18 b.s.util [INFO] Error when trying to kill 48068.
> >Process is probably already dead
> >.
> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shut down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:d5a8d578-8
> >9ff-4a50-a906-75e847ac63a1
> >
> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down and clearing
> >state for id 5154f643-cd79-411
> >9-9368-153f1bede757. Current supervisor time: 1408521676. State:
> >:timed-out, Heartbeat: #backtype.sto
> >rm.daemon.common.WorkerHeartbeat{:time-secs 1408521644, :storm-id
> >"logparser_mobile_nginx-259-1408518
> >662", :executors #{[98 98] [198 198] [48 48] [148 148] [248 248] [-1 -1]},
> >:port 9720}
> >2014-08-20 16:01:18 b.s.d.supervisor [INFO] Shutting down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f6
> >43-cd79-4119-9368-153f1bede757
> >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44976.
> >Process is probably already dead.
> >2014-08-20 16:01:19 b.s.util [INFO] Error when trying to kill 44986.
> >Process is probably already dead.
> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shut down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:5154f643-cd79-4119-9368-153f1bede757
> >
> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down and clearing
> >state for id fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba. Current supervisor
> >time:
> >1408521676. State: :timed-out, Heartbeat:
> >#backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1408521644,
> >:storm-id "app_upload_urls-218-1408503096", :executors #{[8 8] [40 40] [24
> >24] [-1 -1]}, :port 9713}
> >2014-08-20 16:01:19 b.s.d.supervisor [INFO] Shutting down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
> >2014-08-20 16:01:20 b.s.util [INFO] Error when trying to kill 43177.
> >Process is probably already dead.
> >2014-08-20 16:01:20 b.s.d.supervisor [INFO] Shut down
> >6a522a57-cb0b-4a78-8b76-89f23604bf6f:fe9f656a-1f8b-4525-ba89-bbe65fbdb0ba
> >
> >--
> >dashengju
> >+86 13810875910
> >[email protected]
> >
> >
> >
> >--
> >dashengju
> >+86 13810875910
> >[email protected]
>
>


-- 
dashengju
+86 13810875910
[email protected]

Re: worker always timeout to heartbeat and was restarted by supervisor

Reply via email to