Github user revans2 commented on the issue: https://github.com/apache/storm/pull/2433 @danny0405 1. The issues we saw with stability were around when a pacemaker nodes goes down that it was causing exceptions in the clients that we were not handling properly resulting in workers restarting. We have not seen any issues with heartbeat packets being discarded. If this did not cover the issues you were seeing I really would love to have a JIRA so I can fix it. We are not running with 2.x, or even 1.x, in production so I cannot say if there is some oddness happening with what we have pushed back, or perhaps its interactions with HA. We are on 0.10.2++++, we pulled back a lot from 1.x. This is why we really want to get to 2.x so we can we aligned with the community again and hopefully not have these kinds of issues. There may be bugs we don't realize right now. 2. With pacemaker HA if you have 2+ pacemaker servers each of the clients will randomly select one of the servers to send heartbeats to. If the one they try to write to is down, at the beginning or in the middle, the heartbeats should then start going to a different, random, server. This should hopefully keep the load even between the pacemaker servers. Nimbus is supposed to work all of this out by reading from all the servers, and if it finds more than one heartbeat for a worker it will pick the one that has the newest timestamp in it. This does not scale well on the nimbus side, and can take more then 2 mins to download all of the heartbeats, so we have plans to parallelize the download. The metric don't go to the supervisor, as it does not need/use them currently. It only cares if the worker is up and still alive, so it knows if it needs to restart it. 3. I totally believe you that this can support a large cluster. Like I said this is a much better solution long term, and I would love to go this route. We just need to fix the security issues and find a way to support containerized supervisors for me to give it a +1. Both should be doable. 4. There is no security between the workers and pacemaker. There is security between nimbus and pacemaker. This means that only nimbus can see the heartbeats. The worst you can do with faking heartbeats is confuse someone with bad metrics (not ideal) or trick nimbus into thinking a worker is still alive when it is not, bad but not horrible. It is the assignment portion that is scary to me, because it says what to run. If we pull the assignment portion out I would be OK with that. Although it would be best to truly fix it because we don't have a way to selectively turn off authorization in thrift so to make that work we would need a separate thrift server on nimbus, which I would rather not do. I would love to see the ability to do delegation tokens in storm for authentication. This is no small task. It would take a lot of work, especially with HA, which is why I haven't done it.
---