James Xu created STORM-145:
------------------------------
Summary: worker and supervisor heartbeat to nimbus using socket
instead of write zookeeper node
Key: STORM-145
URL: https://issues.apache.org/jira/browse/STORM-145
Project: Apache Storm (Incubating)
Issue Type: Improvement
Reporter: James Xu
Priority: Minor
https://github.com/nathanmarz/storm/issues/732
when a storm cluster manager over thousands excutors, zookeeper Under great
pressure and Very slow to respond write and read, nimbus Judge excutor timeover
and reassigment excutors , all cluster into a dead loop.
if worker and supervisor heartbeat to nimbus using socket will resove this
problem
----------
d2r: Yes, ZK has trouble keeping up sometimes. See #620.
Hopefully #706 would help. We have well over 1k workers on a number of storm
clusters with this patch, and we no longer see this mass-reassignment happening.
----------
revans2: We also had to tune the FileSystem ZK uses to not be as safe in power
failures as it otherwise would be. On ext4 we turned off the barrier my
remounting the disk with -o nobarrier. You also want to make sure that your
disk's cache is enabled. Unless you have a high end RAID controller with a
battery backed cache most admins disable the disk cache on DB boxes to be sure
that data is not lost in the case of a power outage. For us the added
performance is worth the risk of data loss.
----------
xiaokang: Storm daemons(nimbus, supervisor, worker, etc.) are designed to be
stateless so all state including heartbeats are stored in ZK. Workers will
continue work on nimbus failure. If supervisor and workers heartbeat directly
to nimbus, it may be hard to keep this nice feature.
----------
viceyang: @xiaokang socket heartbeat not break stateless feature。if worker
(supervisor)heartbeart to nimbus failure,worker catch the exception and going
on work.when nimbus restart heartbeart will success。
ps: nimbus ha is aonther feature, I am also working on it now.
@d2r seems #706 work well,but using zk do heartbeart seems not nessary,
heartbeart and stats infomation only a Snapshot not need to store, socket
heartbeart and stats information store in nimbus memory meet needs。 on aonther
way our cluster has exceed 100,000 executors,5 node zk can‘t work well.
ps: zk load too high key reason is excutors stats information too large when
executors too much. store heartbeat and stats information in nimbus' memory is
a good way.
----------
d2r: @d2r seems #706 work well,but using zk do heartbeart seems not nessary,
@vinceyang Yes, I agree: There is discussion about this already --> #620.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)