[jira] [Created] (STORM-145) worker and supervisor heartbeat to nimbus using socket instead of write zookeeper node

James Xu (JIRA) Sat, 14 Dec 2013 22:19:40 -0800

James Xu created STORM-145:
------------------------------

             Summary: worker and supervisor heartbeat to nimbus using socket 
instead of write zookeeper node
                 Key: STORM-145
                 URL: https://issues.apache.org/jira/browse/STORM-145
             Project: Apache Storm (Incubating)
          Issue Type: Improvement
            Reporter: James Xu
            Priority: Minor



https://github.com/nathanmarz/storm/issues/732

when a storm cluster manager over thousands excutors， zookeeper Under great 
pressure and Very slow to respond write and read, nimbus Judge excutor timeover 
and reassigment excutors ， all cluster into a dead loop. 
if worker and supervisor heartbeat to nimbus using socket will resove this 
problem

----------
d2r: Yes, ZK has trouble keeping up sometimes. See #620.

Hopefully #706 would help. We have well over 1k workers on a number of storm 
clusters with this patch, and we no longer see this mass-reassignment happening.

----------
revans2: We also had to tune the FileSystem ZK uses to not be as safe in power 
failures as it otherwise would be. On ext4 we turned off the barrier my 
remounting the disk with -o nobarrier. You also want to make sure that your 
disk's cache is enabled. Unless you have a high end RAID controller with a 
battery backed cache most admins disable the disk cache on DB boxes to be sure 
that data is not lost in the case of a power outage. For us the added 
performance is worth the risk of data loss.

----------
xiaokang: Storm daemons(nimbus, supervisor, worker, etc.) are designed to be 
stateless so all state including heartbeats are stored in ZK. Workers will 
continue work on nimbus failure. If supervisor and workers heartbeat directly 
to nimbus, it may be hard to keep this nice feature.

----------
viceyang: @xiaokang socket heartbeat not break stateless feature。if worker 
（supervisor）heartbeart to nimbus failure,worker catch the exception and going 
on work.when nimbus restart heartbeart will success。

ps： nimbus ha is aonther feature, I am also working on it now.

@d2r seems #706 work well，but using zk do heartbeart seems not nessary, 
heartbeart and stats infomation only a Snapshot not need to store， socket 
heartbeart and stats information store in nimbus memory meet needs。 on aonther 
way our cluster has exceed 100，000 executors，5 node zk can‘t work well.

ps: zk load too high key reason is excutors stats information too large when 
executors too much. store heartbeat and stats information in nimbus' memory is 
a good way.

----------
d2r: @d2r seems #706 work well，but using zk do heartbeart seems not nessary,
@vinceyang Yes, I agree: There is discussion about this already --> #620.




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Created] (STORM-145) worker and supervisor heartbeat to nimbus using socket instead of write zookeeper node

Reply via email to