[
https://issues.apache.org/jira/browse/GEODE-9180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bill Burcham updated GEODE-9180:
--------------------------------
Description:
Sometimes we see a member force-disconnected and we see a preceding gap in the
regular sequence of heartbeats generated by the member, but we can't explain
why there was a gap. The explanation we are searching for is usually CPU
saturation. We look for secondary evidence such as gaps in the regular sequence
of statistics e.g. StatSampler sampleCount. When we can't find such secondary
evidence, we can't, in good conscience, rule out bugs in the heartbeat
generation logic itself.
The heartbeat generation logic consists mainly of a thread that loops forever.
Each time through the loop it sleeps for member-timeout / logical-interval. By
default that's 5s / 2 = 2.5s. When it wakes up it sends unreliable UDP unicast
messages to the coordinator and the two non-coordinator members to its "left"
(earlier) in the view. If that heartbeat generation thread oversleeps or
doesn't get adequate time slices when it's awake then heartbeats will be
delayed. There will be gaps in the regular sequence.
When this ticket is complete, a warning-level message will be logged if the
heartbeat generation thread (see {{GMSHealthMonitor.startHeartbeatThread()}})
oversleeps by more than the sleep interval (member-timeout / logical-interval),
i.e. if it is asleep for more than 2 * (member-timeout / logical-interval), the
warning will be logged.
h3. See Also
{{HostStatSample}} generates messages like this:
{quote}Statistics sampling thread detected a wakeup delay of 14318 ms,
indicating a possible resource issue. Check the GC, memory, and CPU
statistics.{quote}
(from {{checkElapsedSleepTime}})
The current ticket is needed because the actual thread of interest for
heartbeat generation is the heartbeat-generation thread and sometimes it
oversleeps when the stat sampler thread does not.
was:
Sometimes we see a member force-disconnected and we see a preceding gap in the
regular sequence of heartbeats generated by the member, but we can't explain
why there was a gap. The explanation we are searching for is usually CPU
saturation. We look for secondary evidence such as gaps in the regular sequence
of statistics e.g. StatSampler sampleCount. When we can't find such secondary
evidence, we can't, in good conscience, rule out bugs in the heartbeat
generation logic itself.
The heartbeat generation logic consists mainly of a thread that loops forever.
Each time through the loop it sleeps for member-timeout / logical-interval. By
default that's 5s / 2 = 2.5s. When it wakes up it sends unreliable UDP unicast
messages to the coordinator and the two non-coordinator members to its "left"
(earlier) in the view. If that heartbeat generation thread oversleeps or
doesn't get adequate time slices when it's awake then heartbeats will be
delayed. There will be gaps in the regular sequence.
When this ticket is complete, a warning-level message will be logged if the
heartbeat generation thread (see {{GMSHealthMonitor.startHeartbeatThread()}})
oversleeps by more than the sleep interval (member-timeout / logical-interval),
i.e. if it is asleep for more than 2 * (member-timeout / logical-interval), the
warning will be logged.
h3. See Also
{{HostStatSample}} generates messages like this:
{quote}Statistics sampling thread detected a wakeup delay of 14318 ms,
indicating a possible resource issue
. Check the GC, memory, and CPU statistics.{quote}
(from {{checkElapsedSleepTime}})
The current ticket is needed because the actual thread of interest for
heartbeat generation is the heartbeat-generation thread and sometimes it
oversleeps when the stat sampler thread does not.
> Heartbeats Are Interrupted Inexplicably
> ---------------------------------------
>
> Key: GEODE-9180
> URL: https://issues.apache.org/jira/browse/GEODE-9180
> Project: Geode
> Issue Type: Bug
> Components: membership
> Reporter: Bill Burcham
> Priority: Major
> Labels: pull-request-available
>
> Sometimes we see a member force-disconnected and we see a preceding gap in
> the regular sequence of heartbeats generated by the member, but we can't
> explain why there was a gap. The explanation we are searching for is usually
> CPU saturation. We look for secondary evidence such as gaps in the regular
> sequence of statistics e.g. StatSampler sampleCount. When we can't find such
> secondary evidence, we can't, in good conscience, rule out bugs in the
> heartbeat generation logic itself.
> The heartbeat generation logic consists mainly of a thread that loops
> forever. Each time through the loop it sleeps for member-timeout /
> logical-interval. By default that's 5s / 2 = 2.5s. When it wakes up it sends
> unreliable UDP unicast messages to the coordinator and the two
> non-coordinator members to its "left" (earlier) in the view. If that
> heartbeat generation thread oversleeps or doesn't get adequate time slices
> when it's awake then heartbeats will be delayed. There will be gaps in the
> regular sequence.
> When this ticket is complete, a warning-level message will be logged if the
> heartbeat generation thread (see {{GMSHealthMonitor.startHeartbeatThread()}})
> oversleeps by more than the sleep interval (member-timeout /
> logical-interval), i.e. if it is asleep for more than 2 * (member-timeout /
> logical-interval), the warning will be logged.
> h3. See Also
> {{HostStatSample}} generates messages like this:
> {quote}Statistics sampling thread detected a wakeup delay of 14318 ms,
> indicating a possible resource issue. Check the GC, memory, and CPU
> statistics.{quote}
> (from {{checkElapsedSleepTime}})
> The current ticket is needed because the actual thread of interest for
> heartbeat generation is the heartbeat-generation thread and sometimes it
> oversleeps when the stat sampler thread does not.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)