Github user revans2 commented on the issue:
https://github.com/apache/storm/pull/2433
@danny0405
1. The issues we saw with stability were around when a pacemaker nodes
goes down that it was causing exceptions in the clients that we were not
handling properly resulting in workers restarting. We have not seen any issues
with heartbeat packets being discarded. If this did not cover the issues you
were seeing I really would love to have a JIRA so I can fix it.
We are not running with 2.x, or even 1.x, in production so I cannot say if
there is some oddness happening with what we have pushed back, or perhaps its
interactions with HA. We are on 0.10.2++++, we pulled back a lot from 1.x.
This is why we really want to get to 2.x so we can we aligned with the
community again and hopefully not have these kinds of issues. There may be
bugs we don't realize right now.
2. With pacemaker HA if you have 2+ pacemaker servers each of the clients
will randomly select one of the servers to send heartbeats to. If the one they
try to write to is down, at the beginning or in the middle, the heartbeats
should then start going to a different, random, server. This should hopefully
keep the load even between the pacemaker servers. Nimbus is supposed to work
all of this out by reading from all the servers, and if it finds more than one
heartbeat for a worker it will pick the one that has the newest timestamp in
it. This does not scale well on the nimbus side, and can take more then 2 mins
to download all of the heartbeats, so we have plans to parallelize the download.
The metric don't go to the supervisor, as it does not need/use them
currently. It only cares if the worker is up and still alive, so it knows if
it needs to restart it.
3. I totally believe you that this can support a large cluster. Like I
said this is a much better solution long term, and I would love to go this
route. We just need to fix the security issues and find a way to support
containerized supervisors for me to give it a +1. Both should be doable.
4. There is no security between the workers and pacemaker. There is
security between nimbus and pacemaker. This means that only nimbus can see the
heartbeats. The worst you can do with faking heartbeats is confuse someone
with bad metrics (not ideal) or trick nimbus into thinking a worker is still
alive when it is not, bad but not horrible. It is the assignment portion that
is scary to me, because it says what to run. If we pull the assignment portion
out I would be OK with that. Although it would be best to truly fix it because
we don't have a way to selectively turn off authorization in thrift so to make
that work we would need a separate thrift server on nimbus, which I would
rather not do.
I would love to see the ability to do delegation tokens in storm for
authentication. This is no small task. It would take a lot of work,
especially with HA, which is why I haven't done it.
---