[
https://issues.apache.org/jira/browse/GEODE-9180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17337539#comment-17337539
]
Bill Burcham edited comment on GEODE-9180 at 4/30/21, 5:27 PM:
---------------------------------------------------------------
I added the blocks-1.14 tag because I believe this is a very low risk,
medium-value problem-diagnosis feature that we want in the 1.14 LTS release.
was (Author: bburcham):
I added the blocker-1.14 tag because I believe this is a very low risk,
medium-value problem-diagnosis feature that we want in the 1.14 LTS release.
> Heartbeats Are Interrupted Inexplicably
> ---------------------------------------
>
> Key: GEODE-9180
> URL: https://issues.apache.org/jira/browse/GEODE-9180
> Project: Geode
> Issue Type: Bug
> Components: membership
> Reporter: Bill Burcham
> Assignee: Bill Burcham
> Priority: Major
> Labels: blocks-1.14.0, pull-request-available
> Fix For: 1.15.0
>
>
> Sometimes we see a member force-disconnected and we see a preceding gap in
> the regular sequence of heartbeats generated by the member, but we can't
> explain why there was a gap. The explanation we are searching for is usually
> CPU saturation. We look for secondary evidence such as gaps in the regular
> sequence of statistics e.g. StatSampler sampleCount. When we can't find such
> secondary evidence, we can't, in good conscience, rule out bugs in the
> heartbeat generation logic itself.
> The heartbeat generation logic consists mainly of a thread that loops
> forever. Each time through the loop it sleeps for member-timeout /
> logical-interval. By default that's 5s / 2 = 2.5s. When it wakes up it sends
> unreliable UDP unicast messages to the coordinator and the two
> non-coordinator members to its "left" (earlier) in the view. If that
> heartbeat generation thread oversleeps or doesn't get adequate time slices
> when it's awake then heartbeats will be delayed. There will be gaps in the
> regular sequence.
> When this ticket is complete, a warning-level message will be logged if the
> heartbeat generation thread (see {{GMSHealthMonitor.startHeartbeatThread()}})
> oversleeps by more than the sleep interval (member-timeout /
> logical-interval), i.e. if it is asleep for more than 2 * (member-timeout /
> logical-interval), the warning will be logged.
> h3. See Also
> {{HostStatSampler}} generates messages like this:
> {quote}Statistics sampling thread detected a wakeup delay of 14318 ms,
> indicating a possible resource issue. Check the GC, memory, and CPU
> statistics.{quote}
> (from {{checkElapsedSleepTime}})
> The current ticket is needed because the actual thread of interest for
> heartbeat generation is the heartbeat-generation thread and sometimes it
> oversleeps when the stat sampler thread does not.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)