[ 
https://issues.apache.org/jira/browse/STORM-145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rick Kellogg updated STORM-145:
-------------------------------
    Component/s: storm-core

> worker and supervisor heartbeat to nimbus using socket instead of write 
> zookeeper node
> --------------------------------------------------------------------------------------
>
>                 Key: STORM-145
>                 URL: https://issues.apache.org/jira/browse/STORM-145
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: James Xu
>            Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/732
> when a storm cluster manager over thousands excutors, zookeeper Under great 
> pressure and Very slow to respond write and read, nimbus Judge excutor 
> timeover and reassigment excutors , all cluster into a dead loop. 
> if worker and supervisor heartbeat to nimbus using socket will resove this 
> problem
> ----------
> d2r: Yes, ZK has trouble keeping up sometimes. See #620.
> Hopefully #706 would help. We have well over 1k workers on a number of storm 
> clusters with this patch, and we no longer see this mass-reassignment 
> happening.
> ----------
> revans2: We also had to tune the FileSystem ZK uses to not be as safe in 
> power failures as it otherwise would be. On ext4 we turned off the barrier my 
> remounting the disk with -o nobarrier. You also want to make sure that your 
> disk's cache is enabled. Unless you have a high end RAID controller with a 
> battery backed cache most admins disable the disk cache on DB boxes to be 
> sure that data is not lost in the case of a power outage. For us the added 
> performance is worth the risk of data loss.
> ----------
> xiaokang: Storm daemons(nimbus, supervisor, worker, etc.) are designed to be 
> stateless so all state including heartbeats are stored in ZK. Workers will 
> continue work on nimbus failure. If supervisor and workers heartbeat directly 
> to nimbus, it may be hard to keep this nice feature.
> ----------
> viceyang: @xiaokang socket heartbeat not break stateless feature。if worker 
> (supervisor)heartbeart to nimbus failure,worker catch the exception and going 
> on work.when nimbus restart heartbeart will success。
> ps: nimbus ha is aonther feature, I am also working on it now.
> @d2r seems #706 work well,but using zk do heartbeart seems not nessary, 
> heartbeart and stats infomation only a Snapshot not need to store, socket 
> heartbeart and stats information store in nimbus memory meet needs。 on 
> aonther way our cluster has exceed 100,000 executors,5 node zk can‘t work 
> well.
> ps: zk load too high key reason is excutors stats information too large when 
> executors too much. store heartbeat and stats information in nimbus' memory 
> is a good way.
> ----------
> d2r: @d2r seems #706 work well,but using zk do heartbeart seems not nessary,
> @vinceyang Yes, I agree: There is discussion about this already --> #620.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to