Detect slow GC loops of death
-----------------------------

                 Key: HBASE-3206
                 URL: https://issues.apache.org/jira/browse/HBASE-3206
             Project: HBase
          Issue Type: Improvement
            Reporter: Jean-Daniel Cryans
             Fix For: 0.92.0


Something that has been bothering me for a while was to understand when a 
region server was being slow because of frequent and small GC pauses. I usually 
go into that RS's GC output, watch it going for a while then decide if it's 
under some kind of memory pressure. Here's an example (grepped "Full" from the 
GC log):

{noformat}
12:03:42.460-0800: [Full GC [CMS2010-11-08T12:03:43.081-0800: 
[CMS-concurrent-mark: 4.381/5.819 secs] [Times: user=60.51 sys=2.54, real=5.82 
secs] 
12:04:06.916-0800: [Full GC [CMS2010-11-08T12:04:07.316-0800: 
[CMS-concurrent-mark: 4.006/5.080 secs] [Times: user=55.16 sys=2.13, real=5.08 
secs] 
12:04:32.559-0800: [Full GC [CMS2010-11-08T12:04:33.286-0800: 
[CMS-concurrent-mark: 4.133/5.303 secs] [Times: user=53.61 sys=2.40, real=5.30 
secs] 
12:05:24.299-0800: [Full GC [CMS2010-11-08T12:05:25.397-0800: 
[CMS-concurrent-sweep: 1.325/1.388 secs] [Times: user=4.66 sys=0.15, real=1.38 
secs] 
12:05:50.069-0800: [Full GC [CMS2010-11-08T12:05:50.240-0800: 
[CMS-concurrent-mark: 4.831/6.346 secs] [Times: user=69.43 sys=2.76, real=6.35 
secs] 
12:06:16.146-0800: [Full GC [CMS2010-11-08T12:06:16.631-0800: 
[CMS-concurrent-mark: 4.942/7.010 secs] [Times: user=69.25 sys=2.69, real=7.01 
secs] 
12:07:08.899-0800: [Full GC [CMS2010-11-08T12:07:10.033-0800: 
[CMS-concurrent-sweep: 1.197/1.202 secs] [Times: user=1.96 sys=0.04, real=1.20 
secs] 
12:08:01.871-0800: [Full GC [CMS2010-11-08T12:08:01.949-0800: 
[CMS-concurrent-mark: 4.154/5.443 secs] [Times: user=61.11 sys=2.29, real=5.44 
secs] 
12:08:53.343-0800: [Full GC [CMS2010-11-08T12:08:53.549-0800: 
[CMS-concurrent-mark: 4.447/5.713 secs] [Times: user=65.19 sys=2.42, real=5.72 
secs] 
12:09:42.841-0800: [Full GC [CMS2010-11-08T12:09:43.664-0800: 
[CMS-concurrent-mark: 4.025/5.053 secs] [Times: user=51.40 sys=2.02, real=5.06 
secs]
{noformat}

In this case, that RS's TT was down so it was getting all the non-local maps at 
the end of the job at the same time... generating a >1000% CPU usage. With 
scanner caching set to 10k, it's easy to understand that there's memory 
pressure since we have all those objects in flight that we don't account for.

One solution I was thinking of was to have a sleeper thread that sleeps for 1 
sec all the time and outputs when it sees that it slept for a bit more than 1 
sec. Then let's say the region server records that it saw a few of those under 
x minutes and decides to somehow throttle the traffic.

What I often saw is that if this situation is kept unnoticed, we end up GCing 
more and more and in some cases I saw a region server going almost zombie for 2 
hours before finally getting it's lease expired.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to