Github user revans2 commented on the issue:

    https://github.com/apache/storm/pull/2433
  
    @danny0405 
    
    1.  The issues we saw with stability were around when a pacemaker nodes 
goes down that it was causing exceptions in the clients that we were not 
handling properly resulting in workers restarting.  We have not seen any issues 
with heartbeat packets being discarded.  If this did not cover the issues you 
were seeing I really would love to have a JIRA so I can fix it.
    
    We are not running with 2.x, or even 1.x, in production so I cannot say if 
there is some oddness happening with what we have pushed back, or perhaps its 
interactions with HA.  We are on 0.10.2++++, we pulled back a lot from 1.x.  
This is why we really want to get to 2.x so we can we aligned with the 
community again and hopefully not have these kinds of issues.  There may be 
bugs we don't realize right now.
    
    2. With pacemaker HA if you have 2+ pacemaker servers each of the clients 
will randomly select one of the servers to send heartbeats to.  If the one they 
try to write to is down, at the beginning or in the middle, the heartbeats 
should then start going to a different, random, server.  This should hopefully 
keep the load even between the pacemaker servers.  Nimbus is supposed to work 
all of this out by reading from all the servers, and if it finds more than one 
heartbeat for a worker it will pick the one that has the newest timestamp in 
it.  This does not scale well on the nimbus side, and can take more then 2 mins 
to download all of the heartbeats, so we have plans to parallelize the download.
    
    The metric don't go to the supervisor, as it does not need/use them 
currently.  It only cares if the worker is up and still alive, so it knows if 
it needs to restart it.
    
    3. I totally believe you that this can support a large cluster.  Like I 
said this is a much better solution long term, and I would love to go this 
route.  We just need to fix the security issues and find a way to support 
containerized supervisors for me to give it a +1. Both should be doable.
    
    4. There is no security between the workers and pacemaker.  There is 
security between nimbus and pacemaker.  This means that only nimbus can see the 
heartbeats.  The worst you can do with faking heartbeats is confuse someone 
with bad metrics (not ideal) or trick nimbus into thinking a worker is still 
alive when it is not, bad but not horrible.  It is the assignment portion that 
is scary to me, because it says what to run.  If we pull the assignment portion 
out I would be OK with that.  Although it would be best to truly fix it because 
we don't have a way to selectively turn off authorization in thrift so to make 
that work we would need a separate thrift server on nimbus, which I would 
rather not do.
    
    I would love to see the ability to do delegation tokens in storm for 
authentication.  This is no small task.  It would take a lot of work, 
especially with HA, which is why I haven't done it.


---

Reply via email to