Re: Sling Topology heartbeat: reduce the amount of repo write activity

Carsten Ziegeler Fri, 07 Feb 2014 07:47:30 -0800

If 1) and 2) are easily doable, I would start with them and see where it
leads us


Regards
Carsten


2014-02-07 16:04 GMT+01:00 Stefan Egli <[email protected]>:

> I think this is basically what I proposed - except that I would be careful
> to handle race-conditions with
> startups/cluster-delays/observation-threading and thus propose to have a
> period of a few minutes in which it assumes to be in a cluster, and only
> afterwards switch back if that's not the case.
>
> Cheers,
> Stefan
>
> On 2/7/14 3:58 PM, "Jörg Hoh" <[email protected]> wrote:
>
> >Hi,
> >
> >some ideas regarding the cluster detection:
> >
> >* When a cluster node comes up, it writes it's intial "I am here"
> >annoucement to the repo.
> >* The node then comes to the listing mode, in which he listens if there's
> >any "pong" or if a new ping comes in, but here it doesn't write any
> >cluster
> >heartbeat information.
> >* If the node receives a "ping" or a "pong", it knows, that it is running
> >indeed in a cluster (either a new partner joined or the node itself joined
> >a cluster) and then starts up the regular cluster heartbeat.
> >
> >In such a case you wouldn't need to handle the cluster case differently
> >from a single node mode, you don't even have to have a timeout.
> >
> >Jörg
> >
> >
> >
> >
> >2014-02-07 Stefan Egli <[email protected]>:
> >
> >> What could be done for level 3:
> >>
> >>  a) at startup the behavior is as is today, cluster-ready, writing
> >> repository-heartbeats as configured
> >>  b) this is done for a configured amount of time at least, eg for 5
> >> minutes (exploring phase) - the idea of this being to avoid any
> >> race-conditions of two nodes starting simultaneously
> >>  c) if after this time, the node realizes, that it is alone (and no-one
> >> joined or left during this time), it assumes that it is indeed in a
> >> standalone setup and stops sending heartbeats (solitude phase)
> >>  d) if another node starts up in the same cluster, it would as normal
> >> start doing these heartbeats for a few minutes (exploring phase) -
> >>giving
> >> the original node time to wake up to the idea that it was never alone
> >> (alien phase) - at which point it quickly starts to go back to sending
> >> heartbeats and voting and all those things (party phase)
> >>
> >> phase d) is obviously slightly tricky ..
> >>
> >> Cheers,
> >> Stefan
> >>
> >> On 2/7/14 3:00 PM, "Stefan Egli" <[email protected]> wrote:
> >>
> >> >Hi,
> >> >
> >> >I like the idea of reducing write-bandwidth used by topology. I'd sum
> >>it
> >> >into three possible levels though:
> >> >
> >> > 1) keep the (topology-connector) announcement's lastHeartbeat as a
> >> >separate property and only update that (on receiving a
> >> >connector-heartbeat) instead of updating the entire announcement-json
> >>as
> >> >is now.
> >> >
> >> > 2) we might even be able to not having to store the announcement's
> >> >lastHeartbeat when the logic is changed, such that the announcement is
> >> >valid as long as the recipient of the announcement (ie the owner) is
> >> >alive. This would increase the reaction time on crash of a remote
> >>instance
> >> >longer though.
> >> >
> >> > 3) avoid repository (ie cluster-local) heartbeats entirely for the
> >> >single-node case (in which case keeping the announcement in memory is
> >> >feasible).
> >> >
> >> >I see level 1 as something we should do, level 2 to be further analyzed
> >> >(verify the implications, but I think it's possible). But I have my
> >> >reservations re level 3, as this would complicate the 'cluster first'
> >> >goal: we'd have to detect situations where a single-node is 'suddenly'
> >> >accompanied by another node to form a cluster, as this would have to be
> >> >detected by discovery.impl. And I fear that this might in the
> >>end-effect
> >> >again result in some sort of heartbeat (maybe for a limited time after
> >> >startup only though). Question is, whether it's a "problem" to have
> >> >cluster-heartbeats stored every say 30 sec and whether that justifies
> >> >complicating the algorithm for this case.
> >> >
> >> >Cheers,
> >> >Stefan
> >> >
> >> >On 2/7/14 2:44 PM, "Jörg Hoh" <[email protected]> wrote:
> >> >
> >> >>Hi,
> >> >>
> >> >>I am thinking if we reduce the amount of data persisted in the
> >>repository
> >> >>with every topology heartbeat.
> >> >>
> >> >>For example we could just update the timestamp of the of announcement
> >> >>hearbeat, if the topology hasn't changed at all (instead of writing
> >>the
> >> >>complete announcement).
> >> >>
> >> >>A more radical approach would be to avoid the persisting of topology
> >> >>information to repo completely, if this node isn't part of a cluster
> >>at
> >> >>all. All the state could be kept in memory, and in case of
> >>crash/restart
> >> >>the topology needs to gathered again. Of course this would require
> >>some
> >> >>more logic in case if a single node is being promoted to a member of
> >>an
> >> >>cluster, as then the current behaviour should be used.
> >> >>
> >> >>WDYT?
> >> >>
> >> >>Jörg
> >> >>
> >> >>
> >> >>--
> >> >>
> >> >>http://cqdump.wordpress.com
> >> >>Twitter: @joerghoh
> >> >
> >>
> >>
> >
> >
> >--
> >Cheers,
> >Jörg Hoh,
> >
> >http://cqdump.wordpress.com
> >Twitter: @joerghoh
>
>


-- 
Carsten Ziegeler
[email protected]

Re: Sling Topology heartbeat: reduce the amount of repo write activity

Reply via email to