Re: Sling Topology heartbeat: reduce the amount of repo write activity

Stefan Egli Fri, 07 Feb 2014 06:41:41 -0800

What could be done for level 3:

 a) at startup the behavior is as is today, cluster-ready, writing
repository-heartbeats as configured
 b) this is done for a configured amount of time at least, eg for 5
minutes (exploring phase) - the idea of this being to avoid any
race-conditions of two nodes starting simultaneously
 c) if after this time, the node realizes, that it is alone (and no-one
joined or left during this time), it assumes that it is indeed in a
standalone setup and stops sending heartbeats (solitude phase)
 d) if another node starts up in the same cluster, it would as normal
start doing these heartbeats for a few minutes (exploring phase) - giving
the original node time to wake up to the idea that it was never alone
(alien phase) - at which point it quickly starts to go back to sending
heartbeats and voting and all those things (party phase)


phase d) is obviously slightly tricky ..

Cheers,
Stefan

On 2/7/14 3:00 PM, "Stefan Egli" <[email protected]> wrote:

>Hi,
>
>I like the idea of reducing write-bandwidth used by topology. I'd sum it
>into three possible levels though:
>
> 1) keep the (topology-connector) announcement's lastHeartbeat as a
>separate property and only update that (on receiving a
>connector-heartbeat) instead of updating the entire announcement-json as
>is now.
>
> 2) we might even be able to not having to store the announcement's
>lastHeartbeat when the logic is changed, such that the announcement is
>valid as long as the recipient of the announcement (ie the owner) is
>alive. This would increase the reaction time on crash of a remote instance
>longer though.
>
> 3) avoid repository (ie cluster-local) heartbeats entirely for the
>single-node case (in which case keeping the announcement in memory is
>feasible).
>
>I see level 1 as something we should do, level 2 to be further analyzed
>(verify the implications, but I think it's possible). But I have my
>reservations re level 3, as this would complicate the 'cluster first'
>goal: we'd have to detect situations where a single-node is 'suddenly'
>accompanied by another node to form a cluster, as this would have to be
>detected by discovery.impl. And I fear that this might in the end-effect
>again result in some sort of heartbeat (maybe for a limited time after
>startup only though). Question is, whether it's a "problem" to have
>cluster-heartbeats stored every say 30 sec and whether that justifies
>complicating the algorithm for this case.
>
>Cheers,
>Stefan
>
>On 2/7/14 2:44 PM, "Jörg Hoh" <[email protected]> wrote:
>
>>Hi,
>>
>>I am thinking if we reduce the amount of data persisted in the repository
>>with every topology heartbeat.
>>
>>For example we could just update the timestamp of the of announcement
>>hearbeat, if the topology hasn't changed at all (instead of writing the
>>complete announcement).
>>
>>A more radical approach would be to avoid the persisting of topology
>>information to repo completely, if this node isn't part of a cluster at
>>all. All the state could be kept in memory, and in case of crash/restart
>>the topology needs to gathered again. Of course this would require some
>>more logic in case if a single node is being promoted to a member of an
>>cluster, as then the current behaviour should be used.
>>
>>WDYT?
>>
>>Jörg
>>
>>
>>-- 
>>
>>http://cqdump.wordpress.com
>>Twitter: @joerghoh
>

Re: Sling Topology heartbeat: reduce the amount of repo write activity

Reply via email to