[ 
https://issues.apache.org/jira/browse/GEODE-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297654#comment-17297654
 ] 

Bill Burcham commented on GEODE-8809:
-------------------------------------

In diagnosing this problem, we rapidly moved through the decision tree (see 
description above) and found ourselves focused on CPU saturation as the likely 
culprit. Indeed, while this investigation was going on, at least one of the 
tests exhibiting the failure had its instance size increased on the assumption 
that CPU saturation was the root cause.

It would have been nice if load average could have been used to prove CPU 
saturation here. While “high load average” has often been cited as 
justification for adding resources (CPU, memory) to tests, it became apparent 
that there exists no agreed-upon, decidable, criteria for doing so. We have 
many tests that regularly operate with peak load averages per core in the 40s 
(!). If a seemingly outlandish (absolute) load average indicated definitively 
that more CPU was needed, we would have to scale up many of our test 
configurations by hundreds of percent. That may end up being the answer, but it 
wasn’t one we entertained in this effort. So we kept searching…

Through many test runs we saw plenty of forced-disconnects, but we weren’t able 
to reliably reproduce them in the absence of drop-outs of StatSampler 
sampleCount. (All the test failures associated with this ticket show a 
forced-disconnect in the absence of a sampleCount disruption.) What we did see 
over many runs with forced-disconnects, was that the drop-out duration varied 
over a range from about 2 seconds to as high as 20 seconds or so. Extra logging 
in the heartbeat-generation task showed that when a member gets 
force-disconnected, its heartbeat generation thread is slowed way down. A log 
statement like this showed a 20-second disparity between the timestamp in the 
message and the timestamp on the log entry:

logger.info("BGB woke from sleep" + System.nanoTime());

The consensus theory is that the fact that we see forced-disconnects with 
sampleCount drop-out varying over a wide range, means that presence of 
sampleCount drop-out indicates CPU saturation, but that absence of dropout does 
not indicate the absence of CPU saturation. There is certainly a distribution 
of error magnitudes in Thread.sleep() wakeup times. And similarly there is a 
distribution of time slice frequency and duration across threads. So while we 
expect sampleCount to slow down when the heartbeat generation thread slows 
down, there is not a simple direct relationship between the two.

We don’t have a satisfying answer, though, for why we weren’t able to reproduce 
the forced-disconnect with no apparent sampleCount drop-out at all. Maybe we 
should have tried more runs.

In an effort to pursue the CPU saturation theory further, a new statistic type: 
LinuxThreadScheduler was added (on a branch) to capture the Linux 
/proc/schedstat statistics. A ticket has since been created for that branch: 
GEODE-9002: Add New Statistic Type For /proc/schedstat, PR here: PR #6090. 
Unfortunately, we were not able to show, through that statistic, the CPU 
saturation we believed was the root cause. The processes for the members 
receiving force-disconnect were seeing meanTaskQueuedTimeNanos in a fairly 
consistent range throughout the test (5 ms - 20 ms) with no significant 
increases before or during the time when heartbeats stopped.

At this point it seems that we have spent an appropriate effort on this ticket. 
I believe that the GMSHealthMonitor heartbeat generation task is functioning 
properly and when it stops generating heartbeats, that’s because its heartbeat 
generation thread’s Thread.sleep() is delayed and/or that thread is not getting 
periodic, sufficient time slices. While I didn’t show that with 
/proc/schedstat, it might be possible to show it with statistics like Brendan 
Gregg’s Linux bcc/BPF Run Queue (Scheduler) Latency tool that captures the 
distribution of ready-to-run thread delays waiting for CPU rather than merely 
the average I captured with my new statistic mentioned above.

While this answer is not fully satisfying it’s going to have to suffice for 
now. I found no product bug, so I am closing this ticket as “won’t fix”.

> Servers are missing heartbeats from a member
> --------------------------------------------
>
>                 Key: GEODE-8809
>                 URL: https://issues.apache.org/jira/browse/GEODE-8809
>             Project: Geode
>          Issue Type: Bug
>          Components: messaging
>            Reporter: Nabarun Nag
>            Assignee: Bill Burcham
>            Priority: Major
>              Labels: blocks-1.14.0​
>
> We see this characteristic failure in a number of proprietary applications:
>  * member stops sending heartbeats
>  * The coordinator is requesting availability test from a member, 
>  * member gets it after a delay
>  * the delay causes the server to be kicked out (receives 
> FordedDisconnectException)
>  * operations fail.
>  * server reconnects.
> Usually when the failure detector/health monitor kicks a member out of the 
> distributed system it is for one of these reasons:
> 1. Member really was malfunctioning or unreachable (i.e. something outside of 
> health monitoring had a problem)
>   a. Network problems
>     i. Partition: 2-way, N-way
>     ii. Slowdown or error rate increase
>   b. CPU was over-taxed in faulty member. We see gaps on the order of 10s or 
> more in heartbeat generation on that member.
>     i. Geode was running in a virtualized environment and the virtualization 
> system didn’t give the Geode process sufficient CPU
>     ii. JVM memory was over-utilized so garbage collection (pauses) took too 
> long
>     iii. There was simply too much CPU demand and the product failed to 
> reserve enough CPU capacity to keep the heartbeat going
> This ticket captures situations where the failure detector causes a member to 
> be kicked out *but we cannot prove definitively that any of these as a root 
> cause*.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to