Looked a little further into this. The Hazelcast documentation says the
heartbeat interval is 1 second, looking into the Hazelcast code we see that
it is actually defaulted to 5 seconds.
HEARTBEAT_INTERVAL_SECONDS("hazelcast.heartbeat.interval.seconds", 5, SECONDS),
So the default configuration in CAS basically set this up to be a race
condition to determine if a node is dead or alive.
On Thu, Jun 2, 2016 at 9:29 AM Misagh Moayyed <[email protected]> wrote:
> Probably too aggressive of a default, yes, but the UM is in seconds:
>
> # hz.cluster.max.heartbeat.seconds=5
>
> Enable that property and set it to 300.
>
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]] On Behalf Of Tom
> > Poage
> > Sent: Thursday, June 2, 2016 8:56 AM
> > To: CAS Community <[email protected]>
> > Subject: Re: [cas-user] Hazelcast heartbeat timeout?
> >
> > So it seems the default heartbeat timeout in Hazelcast is 5 minutes, but
> > the
> > default heartbeat timeout in CAS is 5 seconds.
> >
> > Purposeful (rationale?), or a scaling error?
> >
> > Thanks!
> > Tom.
> >
> > > On Jun 2, 2016, at 8:12 AM, Tom Poage <[email protected]> wrote:
> > >
> > > Morning,
> > >
> > > We started running 4.2.1 w/ Hazelcast (hz.cluster.tcpip.enabled=true)
> on
> > Linux VMs (RedHat variant) a couple of weeks ago with three nodes on the
> > same subnet. Things seemed fine initially, but a couple of days ago
> > started
> > getting cluster errors starting with heartbeat timeout, several
> > (dis)connects,
> > attempted repartitions, and ending with the cluster frozen.
> > >
> > > Has anyone experience this? E.g.
> > >
> > >> 2016-06-01 21:01:25,330 WARN
> > [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [-------.50]:5701
> > [dev]
> > [3.6] Removing Member [------.55]:5701 because it has not sent any
> > heartbeats for 5000 ms. Last heartbeat time was Wed Jun 01 21:01:20 PDT
> > 2016
> > >> 2016-06-01 21:01:25,330 INFO [com.hazelcast.cluster.ClusterService] -
> > >> [----
> > --.50]:5701 [dev] [3.6] Old master Address[------.55]:5701 left the
> > cluster,
> > assigning new master Member [128.120.39.50]:5701 this
> > > ...
> > >> 2016-06-01 21:01:29,167 WARN
> > [com.hazelcast.partition.InternalPartitionService] - [------.50]:5701
> > [dev] [3.6]
> > This is the master node and received a PartitionRuntimeState from
> > Address[---
> > ---.55]:5701. Ignoring incoming state!
> > > ...
> > >> 2016-06-01 21:05:16,046 INFO
> > [com.hazelcast.cluster.impl.operations.JoinCheckOperation] -
> > [------.50]:5701
> > [dev] [3.6] Ignoring join check from Address[------.55]:5701, because
> > cluster is
> > in FROZEN state ...
> > >
> > > Interestingly enough, if we shut down one of the nodes (leaving two),
> > > the
> > issue does not recur--at least in the time we've been monitoring.
> > >
> > > The only recourse seems to be a full cluster restart.
> > >
> > > Thanks for any advice!
> > >
> > > Tom.
> > >
> > >
> > > --
> > > You received this message because you are subscribed to the Google
> > Groups "CAS Community" group.
> > > To unsubscribe from this group and stop receiving emails from it, send
> > > an
> > email to [email protected].
> > > To post to this group, send email to [email protected].
> > > Visit this group at https://groups.google.com/a/apereo.org/group/cas-
> > user/.
> > > To view this discussion on the web visit
> > https://groups.google.com/a/apereo.org/d/msgid/cas-user/B8F20E5F-0BC3-
> > 44AE-B53F-BCFD1B181E3D%40ucdavis.edu.
> > > For more options, visit
> https://groups.google.com/a/apereo.org/d/optout.
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "CAS Community" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email to [email protected].
> > To post to this group, send email to [email protected].
> > Visit this group at
> > https://groups.google.com/a/apereo.org/group/cas-user/.
> > To view this discussion on the web visit
> > https://groups.google.com/a/apereo.org/d/msgid/cas-user/51FE920A-2FEE-
> > 4C59-A75C-C1053256CACB%40ucdavis.edu.
> > For more options, visit https://groups.google.com/a/apereo.org/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "CAS Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/a/apereo.org/group/cas-user/
> .
> To view this discussion on the web visit
> https://groups.google.com/a/apereo.org/d/msgid/cas-user/006701d1bceb%24ecab22b0%24c6016810%24%40unicon.net
> .
> For more options, visit https://groups.google.com/a/apereo.org/d/optout.
>
--
You received this message because you are subscribed to the Google Groups "CAS
Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/a/apereo.org/group/cas-user/.
To view this discussion on the web visit
https://groups.google.com/a/apereo.org/d/msgid/cas-user/CAC_RtEZAY3oOuE_pC7T1EPHGkju6YoLVBvvo5%2BMVeMNT6TNkpA%40mail.gmail.com.
For more options, visit https://groups.google.com/a/apereo.org/d/optout.