Re: Zookeeper issues mentioned in a talk about storm / heron

Mark Payne Thu, 07 Apr 2016 12:14:34 -0700

Tony,

Certainly some good points here. I definitely can agree that there is some 
concern
about the resource contention between heart-beating and state management.

In testing the PR 323 for heart-beating to ZooKeeper, we did notice that if we 
lose
the quorum, the nodes are no longer able to know which nodes are sending 
heartbeats,
which results in the state of the cluster really being unknown. For example, if 
we have a 3
node NiFi cluster, all running an embedded ZooKeeper, if we lose 2 nodes, we 
are now
reporting that we have 2/3 nodes instead of 1/3 nodes because we can't read 
from ZooKeeper
to determine that the second node is missing.

While I guess I did realize that this was the way it would work, it wasn't 
clear how poor the
usability would be here, when losing a quorum means that the state of the 
cluster really is
unknown.

Also, as the heart-beating was refactored, we dramatically trimmed the size of 
Heartbeat messages
to a very small size (probably around 1-2 KB). Also, with no NCM running the 
show anymore, all nodes
within a cluster will be required to be able to communicate with one another to 
send Cluster Protocol
messages.

To this end, I think the better solution regarding heart-beating is to simply 
have nodes send their heartbeat
to all nodes in the cluster. This allows all nodes to know the current state. 
Since the heartbeats are now
very small, the network chatter will be pretty limited.

This means that ZooKeeper will be required only for two things: Leader 
Election, and State Management
(with the ability to provide a different storage mechanism for State Management 
later, if we see the need).
Leader Election still would be used to elect a 'Cluster Coordinator' who is 
responsible for (among other things)
monitoring heartbeats. If that node does not receive a heartbeat from Node X in 
some amount of time, it would
notify all nodes in the cluster that Node X is now disconnected. This way, even 
if the Leader is unable to
communicate with Node X, all other nodes in the cluster know it is disconnected 
and will issue Node X
a Disconnection Request the next time that Node X heartbeats.

I have created a JIRA [1] where we can track any further concerns that may 
develop in the community.

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-1743 
<https://issues.apache.org/jira/browse/NIFI-1743>

> On Apr 4, 2016, at 12:40 PM, Tony Kurc <[email protected]> wrote:
> 
> Mark,
> Fair points!
> 
> Something an Apache Accumulo committer pointed out at meetup is the is that
> the scale issues may come sooner than a couple hundred nodes due to the
> size of writes and potential frequency of writes (Joe Percivall's
> demonstration on the windowing seemed like it could write much more
> frequently than a couple times a minute).
> 
> Another point someone brought up is that if heartbeating and state
> management are competing for resources, bad things can happen.
> 
> Tony
> 
> On Fri, Apr 1, 2016 at 9:31 AM, Mark Payne <[email protected]> wrote:
> 
>> Guys,
>> 
>> Certainly some great points here and important concepts to keep in mind!
>> 
>> One thing to remember, though, that very much differentiates NIFi from
>> Storm
>> or HBase or Accumulo: those systems are typically expected to scale to
>> hundreds
>> or thousands of nodes (you mention a couple hundred node HBase cluster
>> being
>> "modest"). With NiFi, we are typically operating clusters on the order of
>> several
>> nodes to dozens of nodes. A couple hundred nodes would be a pretty massive
>> NiFi
>> cluster.
>> 
>> In terms of storing state, we could potentially get into more of a sticky
>> situation
>> if not done carefully. However, we generally expect the "State Management"
>> feature to be used for
>> occasionally storing small amounts of state, such as for ListHDFS storing
>> 2 timestamps
>> and we don't expect ListHDFS to be continually hammering HDFS asking for
>> a new listing. It may be scheduled to run once every few minutes for
>> example.
>> 
>> That said, we certainly have been designing everything using appropriate
>> interfaces
>> in such a way that if we do later decide that we want to use some other
>> mechanism
>> for storing state and/or heartbeats, it should be a very reasonable path to
>> take advantage of some other service for one or both of these tasks.
>> 
>> Thanks
>> -Mark
>> 
>> 
>> 
>>> On Mar 31, 2016, at 5:35 PM, Sean Busbey <[email protected]> wrote:
>>> 
>>> HBase has also had issues with ZK at modest (~couple hundred node)
>>> scale when using it to act as an intermediary for heartbeats and to do
>>> work assignment.
>>> 
>>> On Thu, Mar 31, 2016 at 4:33 PM, Tony Kurc <[email protected]> wrote:
>>>> My commentary that didn't accompany this - it appears Storm was using
>>>> zookeeper in a similar way as the road we're heading down, and it was
>> such
>>>> a major bottleneck that they moved key value storage and heartbeating
>> out
>>>> into separate services, and then re-engineering (i.e. built Heron).
>> Before
>>>> we get too dependent on zookeeper, may be worth learning some lessons
>> from
>>>> the crew that built Heron or from a team that learned zookeeper lessons
>>>> scale like accumulo.
>>>> 
>>>> On Thu, Mar 24, 2016 at 6:22 PM, Tony Kurc <[email protected]> wrote:
>>>> 
>>>>> I mentioned slides I saw at the meetup about zookeeper perils at scale
>> in
>>>>> storm, here are slides, i couldn't find a video after some limited
>>>>> searching.
>>>>> 
>> https://qconsf.com/system/files/presentation-slides/heron-qcon-2015.pdf
>>>>> 
>> 
>>

Re: Zookeeper issues mentioned in a talk about storm / heron

Reply via email to