Detect slow GC loops of death
-----------------------------
Key: HBASE-3206
URL: https://issues.apache.org/jira/browse/HBASE-3206
Project: HBase
Issue Type: Improvement
Reporter: Jean-Daniel Cryans
Fix For: 0.92.0
Something that has been bothering me for a while was to understand when a
region server was being slow because of frequent and small GC pauses. I usually
go into that RS's GC output, watch it going for a while then decide if it's
under some kind of memory pressure. Here's an example (grepped "Full" from the
GC log):
{noformat}
12:03:42.460-0800: [Full GC [CMS2010-11-08T12:03:43.081-0800:
[CMS-concurrent-mark: 4.381/5.819 secs] [Times: user=60.51 sys=2.54, real=5.82
secs]
12:04:06.916-0800: [Full GC [CMS2010-11-08T12:04:07.316-0800:
[CMS-concurrent-mark: 4.006/5.080 secs] [Times: user=55.16 sys=2.13, real=5.08
secs]
12:04:32.559-0800: [Full GC [CMS2010-11-08T12:04:33.286-0800:
[CMS-concurrent-mark: 4.133/5.303 secs] [Times: user=53.61 sys=2.40, real=5.30
secs]
12:05:24.299-0800: [Full GC [CMS2010-11-08T12:05:25.397-0800:
[CMS-concurrent-sweep: 1.325/1.388 secs] [Times: user=4.66 sys=0.15, real=1.38
secs]
12:05:50.069-0800: [Full GC [CMS2010-11-08T12:05:50.240-0800:
[CMS-concurrent-mark: 4.831/6.346 secs] [Times: user=69.43 sys=2.76, real=6.35
secs]
12:06:16.146-0800: [Full GC [CMS2010-11-08T12:06:16.631-0800:
[CMS-concurrent-mark: 4.942/7.010 secs] [Times: user=69.25 sys=2.69, real=7.01
secs]
12:07:08.899-0800: [Full GC [CMS2010-11-08T12:07:10.033-0800:
[CMS-concurrent-sweep: 1.197/1.202 secs] [Times: user=1.96 sys=0.04, real=1.20
secs]
12:08:01.871-0800: [Full GC [CMS2010-11-08T12:08:01.949-0800:
[CMS-concurrent-mark: 4.154/5.443 secs] [Times: user=61.11 sys=2.29, real=5.44
secs]
12:08:53.343-0800: [Full GC [CMS2010-11-08T12:08:53.549-0800:
[CMS-concurrent-mark: 4.447/5.713 secs] [Times: user=65.19 sys=2.42, real=5.72
secs]
12:09:42.841-0800: [Full GC [CMS2010-11-08T12:09:43.664-0800:
[CMS-concurrent-mark: 4.025/5.053 secs] [Times: user=51.40 sys=2.02, real=5.06
secs]
{noformat}
In this case, that RS's TT was down so it was getting all the non-local maps at
the end of the job at the same time... generating a >1000% CPU usage. With
scanner caching set to 10k, it's easy to understand that there's memory
pressure since we have all those objects in flight that we don't account for.
One solution I was thinking of was to have a sleeper thread that sleeps for 1
sec all the time and outputs when it sees that it slept for a bit more than 1
sec. Then let's say the region server records that it saw a few of those under
x minutes and decides to somehow throttle the traffic.
What I often saw is that if this situation is kept unnoticed, we end up GCing
more and more and in some cases I saw a region server going almost zombie for 2
hours before finally getting it's lease expired.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.