Re: ZooKeeper Session Timeout During Startup Caused By Long ParNewGC Pauses

James Lent Mon, 04 May 2015 11:52:39 -0700

This problem is so mysterious to me that I am hesitant to make any definitive 
statements regarding the scope of the issue.  In the staging environment where 
the problem was originally seen the GC logs were not enabled.  In that 
environment the session timeouts persisted for a long time (and operations 
tried several restarts to clear the issue).  Here are some sample logs:


$ grep -e starting -e "session time" 
uverse-staging-admin-0-20150423150103-kafka.log
[2015-04-23 14:23:53,837] INFO Client session timed out, have not heard from 
server in 10066ms for sessionid 0x14ce69c464f0009, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:31:56,133] INFO [Kafka Server 0], starting 
(kafka.server.KafkaServer)
[2015-04-23 14:32:08,636] INFO Client session timed out, have not heard from 
server in 10422ms for sessionid 0x14ce6b1209f0002, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:32:21,641] INFO Client session timed out, have not heard from 
server in 10643ms for sessionid 0x14ce6b1209f0003, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:32:33,614] INFO Client session timed out, have not heard from 
server in 9842ms for sessionid 0x14ce6b1209f0004, closing socket connection and 
attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:32:45,589] INFO Client session timed out, have not heard from 
server in 9878ms for sessionid 0x14ce6b1209f0005, closing socket connection and 
attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:32:57,906] INFO Client session timed out, have not heard from 
server in 9823ms for sessionid 0x14ce6b1209f0006, closing socket connection and 
attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:33:10,701] INFO Client session timed out, have not heard from 
server in 10365ms for sessionid 0x14ce6b1209f0007, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:33:22,475] INFO Client session timed out, have not heard from 
server in 10006ms for sessionid 0x14ce6b1209f0008, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:33:46,372] INFO Client session timed out, have not heard from 
server in 11062ms for sessionid 0x14ce6b1209f0009, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:42:11,942] INFO Client session timed out, have not heard from 
server in 12582ms for sessionid 0x14ce6b1209f000a, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2015-04-23 14:55:47,273] INFO Client session timed out, have not heard from 
server in 12935ms for sessionid 0x14ce6b1209f000b, closing socket connection 
and attempting reconnect (org.apache.zookeeper.ClientCnxn)

The specific configuration of Kafka (heap size, number of topics, ...) seem to 
have a big influence on the behavior.

Using the test program I have generally seen the ParNew GCs times eventually 
recover, but, it can take a long time (and sometimes I kill the process before 
it recovers).  Here is an example of one that persists for 2 ParNew GC cycles:

[ec2-user@ip-10-136-75-2 ~]$ jdk1.8.0_45/bin/java -Xmx4G -Xms4G 
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCTimeStamps 
-XX:+PrintGCDetails LongParNewPause $((500*1024*1024)) 100 10
0.289: [GC (Allocation Failure) 0.289: [ParNew: 272640K->27707K(306688K), 
0.0322014 secs] 784640K->539707K(4160256K), 0.0322659 secs] [Times: user=0.10 
sys=0.03, real=0.04 secs]
0.360: [GC (Allocation Failure) 0.360: [ParNew: 300347K->34048K(306688K), 
3.5932171 secs] 812347K->572709K(4160256K), 3.5932638 secs] [Times: user=14.12 
sys=0.05, real=3.60 secs]
3.992: [GC (Allocation Failure) 3.992: [ParNew: 306688K->34048K(306688K), 
53.1317959 secs] 845349K->599722K(4160256K), 53.1318434 secs] [Times: user=212.40 
sys=0.00, real=53.13 secs]
57.162: [GC (Allocation Failure) 57.162: [ParNew: 306688K->34048K(306688K), 
0.0500041 secs] 872362K->626660K(4160256K), 0.0500528 secs] [Times: user=0.17 
sys=0.01, real=0.05 secs]
57.250: [GC (Allocation Failure) 57.250: [ParNew: 306688K->34048K(306688K), 
0.0503560 secs] 899300K->658186K(4160256K), 0.0504036 secs] [Times: user=0.17 
sys=0.01, real=0.05 secs]
57.339: [GC (Allocation Failure) 57.339: [ParNew: 306688K->34048K(306688K), 
0.0510655 secs] 930826K->687994K(4160256K), 0.0511131 secs] [Times: user=0.17 
sys=0.00, real=0.05 secs]
57.428: [GC (Allocation Failure) 57.428: [ParNew: 306688K->34048K(306688K), 
0.0516372 secs] 960634K->720213K(4160256K), 0.0516853 secs] [Times: user=0.17 
sys=0.01, real=0.05 secs]

Just changing the fraction of objects saved from "1 out of 10" to "1 out of 
100" can dramatically increase the recovery time to 11 ParNew GCs:

[ec2-user@ip-10-136-75-2 ~]$ jdk1.8.0_45/bin/java -Xmx4G -Xms4G 
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCTimeStamps 
-XX:+PrintGCDetails LongParNewPause $((500*1024*1024)) 100 100
0.290: [GC (Allocation Failure) 0.290: [ParNew: 272640K->3052K(306688K), 0.0161574 
secs] 784640K->515052K(4160256K), 0.0162239 secs] [Times: user=0.05 sys=0.02, 
real=0.01 secs]
0.346: [GC (Allocation Failure) 0.346: [ParNew: 275692K->7135K(306688K), 0.0156640 
secs] 787692K->519135K(4160256K), 0.0157112 secs] [Times: user=0.06 sys=0.00, 
real=0.02 secs]
0.401: [GC (Allocation Failure) 0.401: [ParNew: 279775K->10158K(306688K), 
0.0162892 secs] 791775K->522158K(4160256K), 0.0163360 secs] [Times: user=0.06 
sys=0.00, real=0.02 secs]
0.457: [GC (Allocation Failure) 0.457: [ParNew: 282798K->15495K(306688K), 
0.0174881 secs] 794798K->527495K(4160256K), 0.0175358 secs] [Times: user=0.07 
sys=0.00, real=0.02 secs]
0.513: [GC (Allocation Failure) 0.513: [ParNew: 288135K->21121K(306688K), 
0.0187323 secs] 800135K->533121K(4160256K), 0.0187802 secs] [Times: user=0.08 
sys=0.01, real=0.02 secs]
0.571: [GC (Allocation Failure) 0.571: [ParNew: 293761K->26920K(306688K), 
0.0193083 secs] 805761K->538920K(4160256K), 0.0193544 secs] [Times: user=0.08 
sys=0.00, real=0.02 secs]
0.630: [GC (Allocation Failure) 0.630: [ParNew: 299560K->21225K(306688K), 
0.8494763 secs] 811560K->536148K(4160256K), 0.8495230 secs] [Times: user=3.38 
sys=0.00, real=0.85 secs]
1.519: [GC (Allocation Failure) 1.519: [ParNew: 293865K->21322K(306688K), 
4.5105759 secs] 808788K->538970K(4160256K), 4.5106276 secs] [Times: user=18.02 
sys=0.01, real=4.51 secs]
6.069: [GC (Allocation Failure) 6.069: [ParNew: 293962K->21348K(306688K), 
9.0200268 secs] 811610K->541719K(4160256K), 9.0200810 secs] [Times: user=36.04 
sys=0.00, real=9.02 secs]
15.128: [GC (Allocation Failure) 15.128: [ParNew: 293988K->21357K(306688K), 
14.4123763 secs] 814359K->544451K(4160256K), 14.4124320 secs] [Times: user=57.61 
sys=0.00, real=14.41 secs]
29.580: [GC (Allocation Failure) 29.580: [ParNew: 293997K->21833K(306688K), 
19.8231772 secs] 817091K->547646K(4160256K), 19.8232320 secs] [Times: user=79.23 
sys=0.01, real=19.82 secs]
49.442: [GC (Allocation Failure) 49.442: [ParNew: 294473K->21827K(306688K), 
24.5076227 secs] 820286K->550367K(4160256K), 24.5076707 secs] [Times: user=97.96 
sys=0.00, real=24.51 secs]
73.989: [GC (Allocation Failure) 73.989: [ParNew: 294467K->21825K(306688K), 
29.8300475 secs] 823007K->553082K(4160256K), 29.8300951 secs] [Times: user=119.25 
sys=0.01, real=29.83 secs]
103.858: [GC (Allocation Failure) 103.858: [ParNew: 294465K->25919K(306688K), 
36.1503266 secs] 825722K->559903K(4160256K), 36.1503960 secs] [Times: user=144.50 
sys=0.00, real=36.15 secs]
140.048: [GC (Allocation Failure) 140.048: [ParNew: 298559K->30008K(306688K), 
42.2171055 secs] 832543K->566717K(4160256K), 42.2171619 secs] [Times: user=168.78 
sys=0.00, real=42.21 secs]
182.305: [GC (Allocation Failure) 182.305: [ParNew: 302648K->22540K(306688K), 
48.9293150 secs] 839357K->561976K(4160256K), 48.9293635 secs] [Times: user=195.62 
sys=0.01, real=48.93 secs]
231.273: [GC (Allocation Failure) 231.273: [ParNew: 295180K->31702K(306688K), 
54.8912986 secs] 834616K->573850K(4160256K), 54.8913474 secs] [Times: user=219.46 
sys=0.01, real=54.89 secs]
286.204: [GC (Allocation Failure) 286.204: [ParNew: 304342K->21761K(306688K), 
0.0212730 secs] 846490K->566635K(4160256K), 0.0213218 secs] [Times: user=0.07 
sys=0.00, real=0.02 secs]
286.264: [GC (Allocation Failure) 286.264: [ParNew: 294401K->31528K(306688K), 
0.0204699 secs] 839275K->579129K(4160256K), 0.0205177 secs] [Times: user=0.08 
sys=0.00, real=0.02 secs]
286.324: [GC (Allocation Failure) 286.324: [ParNew: 304168K->21717K(306688K), 
0.0199914 secs] 851769K->572045K(4160256K), 0.0200406 secs] [Times: user=0.07 
sys=0.00, real=0.02 secs]
286.383: [GC (Allocation Failure) 286.383: [ParNew: 294357K->31518K(306688K), 
0.0202584 secs] 844685K->584571K(4160256K), 0.0203057 secs] [Times: user=0.08 
sys=0.00, real=0.02 secs]

Interesting the last slow time in both cases is about the same: 53 to 54 
seconds.  I have seen that type of pattern hold up with different heap sizes 
(although the max time increases with heap size). Probably a clue to 
understanding the underlying problem.

I just tried 1 out of 1000 and it is still running:

[ec2-user@ip-10-136-75-2 ~]$ jdk1.8.0_45/bin/java -Xmx4G -Xms4G 
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCTimeStamps 
-XX:+PrintGCDetails LongParNewPause $((500*1024*1024)) 100 1000
0.285: [GC (Allocation Failure) 0.285: [ParNew: 272640K->548K(306688K), 0.0134350 
secs] 784640K->512548K(4160256K), 0.0134952 secs] [Times: user=0.04 sys=0.00, 
real=0.01 secs]
0.335: [GC (Allocation Failure) 0.335: [ParNew: 273188K->1141K(306688K), 0.0122108 
secs] 785188K->513141K(4160256K), 0.0122587 secs] [Times: user=0.05 sys=0.00, 
real=0.01 secs]
0.383: [GC (Allocation Failure) 0.383: [ParNew: 273781K->1443K(306688K), 0.0124031 
secs] 785781K->513443K(4160256K), 0.0124503 secs] [Times: user=0.05 sys=0.00, 
real=0.01 secs]
0.431: [GC (Allocation Failure) 0.431: [ParNew: 274083K->1754K(306688K), 0.0122479 
secs] 786083K->513754K(4160256K), 0.0122919 secs] [Times: user=0.05 sys=0.00, 
real=0.01 secs]
0.479: [GC (Allocation Failure) 0.479: [ParNew: 274394K->2553K(306688K), 0.0125259 
secs] 786394K->514553K(4160256K), 0.0125721 secs] [Times: user=0.05 sys=0.00, 
real=0.02 secs]
0.527: [GC (Allocation Failure) 0.527: [ParNew: 275193K->3127K(306688K), 0.0130623 
secs] 787193K->515127K(4160256K), 0.0131086 secs] [Times: user=0.05 sys=0.00, 
real=0.02 secs]
0.575: [GC (Allocation Failure) 0.576: [ParNew: 275767K->3219K(306688K), 0.6621319 
secs] 787767K->515744K(4160256K), 0.6621784 secs] [Times: user=2.64 sys=0.00, 
real=0.66 secs]
... [ 38 ParNew logs deleted ] ...
405.220: [GC (Allocation Failure) 405.220: [ParNew: 274901K->2669K(306688K), 
20.4550033 secs] 798938K->526979K(4160256K), 20.4550536 secs] [Times: user=81.77 
sys=0.01, real=20.45 secs]
425.711: [GC (Allocation Failure) 425.711: [ParNew: 275309K->2669K(306688K), 
20.9100008 secs] 799619K->527251K(4160256K), 20.9100495 secs] [Times: user=83.59 
sys=0.01, real=20.91 secs]
446.657: [GC (Allocation Failure) 446.657: [ParNew: 275309K->2669K(306688K), 
22.5119299 secs] 799891K->527524K(4160256K), 22.5119871 secs] [Times: user=89.90 
sys=0.02, real=22.51 secs]
469.204: [GC (Allocation Failure) 469.204: [ParNew: 275309K->2669K(306688K), 
21.8446029 secs] 800164K->527797K(4160256K), 21.8446524 secs] [Times: user=87.31 
sys=0.01, real=21.84 secs]
491.085: [GC (Allocation Failure) 491.085: [ParNew




James.

Thanks for reporting this. So the issue only shows up at Kafka startup
time? After that, the GC time is back to normal again?

Jun

On Tue, Apr 28, 2015 at 1:48 PM, James Lent 
<jl...@tivo.com><mailto:jl...@tivo.com> wrote:



I have filed a bug report with Oracle on this issue today.  I have not yet
been assigned a bug number.  I thought the issue might be of general
interest to the Kafka community and I have a suggestion for a configuration
change that works around the issue in case other people tun into it.

First the issue I am about to describe will be hard to believe, but, I do
provide a short Java program that demonstrates the JVM GC issue that
triggers the long ParNewGC pauses under certain conditions.  Before you
dismiss the analysis out of hand I suggest you try try test program below.
I have tested it in multiple environments: linux dev box, linux laptop, and
a linux EC2 instance.  Other engineers have also done so in their own
environments.  I would be interested to know what you see.

First the conditions:

1) More than a 1G heap (we ran into it when we increased the heap from 1G
to 2G).
2) Requires at least some topics defined.  Not sure the exact amount.  We
don't see this issue in all our environments.
3) Log compaction enabled.
4) Using JVM other than Oracle 6.  I have reproduced the underlying issue
(with the test program below) using OpenJDK 6, Oracle 7, OpenJDK 7, and
Oracle 8 and OpenJDK 8.

Under these conditions we can see 9 second or longer pauses due to
individual ParNew GC events (which is greater than the default 6 second
ZooKeeper session timeout).  It tends to be very reproducible in a given
environment (e.g. either always happens or it never happens).  We are using
Kafka 0.8.1.1.

The long ParNewGC delays are triggered by the allocation of a large static
object during startup.  Specially the SkimpyOffsetMap bytes buffer.  There
is no reason why this should be the case and that is why I have opened a
JVM bug with Oracle.  The even more unbelievable part is this bug only
seems to trigger a problem for static objects in the 480M to 512M range.
The default size of this byte buffer is 500M.  Our current workaround it to
configure this buffer to use 400M instead.  We have verified that this
makes the issue go away in multiple "real" Kafka environments.  Using the
G1 GC is another option we are considering (and seems to work).

Here is the test program (note that the size of the static object matches
the default size of the SkimpyOffsetMap bytes buffer):

import java.util.ArrayList;

public class LongParNewPause {

  static byte[] bigStaticObject;

  public static void main(String[] args) throws Exception {
      int bigObjSize    = args.length > 0 ? Integer.parseInt(args[0]) :
524288000;
      int littleObjSize = args.length > 1 ? Integer.parseInt(args[1]) :
100;
      int saveFraction  = args.length > 2 ? Integer.parseInt(args[2]) :
15;

      bigStaticObject = new byte[bigObjSize];
      ArrayList<byte[]> holder = new ArrayList<byte[]>();

      int i = 0;
      while (true) {
          byte[] local = new byte[littleObjSize];
          if (i++ % saveFraction == 0) {
              holder.add(local);
          }
      }
  }
}


You should run it with the following JVM options:

-verbose:gc
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-Xmx2G
-Xms2G

Here is what I see using Oracle 8:

0.355: [GC (Allocation Failure) 0.355: [ParNew: 272640K->18193K(306688K),
0.0265380 secs] 784640K->530193K(4160256K), 0.0266064 secs] [Times:
user=0.07 sys=0.00, real=0.03 secs]
0.420: [GC (Allocation Failure) 0.420: [ParNew: 290833K->23124K(306688K),
11.4522887 secs] 802833K->552573K(4160256K), 11.4523296 secs] [Times:
user=42.51 sys=0.18, real=11.45 secs]
11.902: [GC (Allocation Failure) 11.902: [ParNew:
295764K->33361K(306688K), 25.6937427 secs] 825213K->580830K(4160256K),
25.6937906 secs] [Times: user=100.63 sys=0.11, real=25.70 secs]
37.627: [GC (Allocation Failure) 37.627: [ParNew:
306001K->25620K(306688K), 0.0219097 secs] 853470K->591124K(4160256K),
0.0219448 secs] [Times: user=0.08 sys=0.00, real=0.02 secs]
...

If I pass in 500,000,000 for the size:

0.344: [GC (Allocation Failure) 0.344: [ParNew: 272640K->18181K(306688K),
0.0253515 secs] 760921K->506462K(4160256K), 0.0253957 secs] [Times:
user=0.04 sys=0.01, real=0.02 secs]
0.408: [GC (Allocation Failure) 0.408: [ParNew: 290821K->31804K(306688K),
0.0499212 secs] 779102K->537534K(4160256K), 0.0499556 secs] [Times:
user=0.08 sys=0.01, real=0.05 secs]
0.491: [GC (Allocation Failure) 0.491: [ParNew: 304444K->27084K(306688K),
0.0286135 secs] 810174K->550834K(4160256K), 0.0286590 secs] [Times:
user=0.08 sys=0.00, real=0.03 secs]
0.553: [GC (Allocation Failure) 0.554: [ParNew: 299724K->27617K(306688K),
0.0387482 secs] 823474K->569401K(4160256K), 0.0387790 secs] [Times:
user=0.07 sys=0.01, real=0.04 secs]
0.625: [GC (Allocation Failure) 0.625: [ParNew: 300257K->21936K(306688K),
0.0368052 secs] 842041K->584850K(4160256K), 0.0368367 secs] [Times:
user=0.08 sys=0.00, real=0.03 secs]
...

Playing with the other two parameters can vary the results, but, they are
all bad if the size of the static object is in the "bad" range.


James Lent
Senior Software Engineer

Digitalsmiths
A TiVo Company

www.digitalsmiths.com<http://www.digitalsmiths.com><http://www.digitalsmiths.com/><http://www.digitalsmiths.com/>
jl...@digitalsmiths.com<mailto:jl...@digitalsmiths.com><mailto:jl...@digitalsmiths.com><mailto:jl...@digitalsmiths.com>
  | office
919.460.4747

________________________________

This email and any attachments may contain confidential and privileged
material for the sole use of the intended recipient. Any review, copying,
or distribution of this email (or any attachments) by others is prohibited.
If you are not the intended recipient, please contact the sender
immediately and permanently delete this email and any attachments. No
employee or agent of TiVo Inc. is authorized to conclude any binding
agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo
Inc. may only be made by a signed written agreement.






James Lent
Senior Software Engineer

Digitalsmiths
A TiVo Company

www.digitalsmiths.com<http://www.digitalsmiths.com/>
jl...@digitalsmiths.com<mailto:jl...@digitalsmiths.com>  | office 919.460.4747

________________________________

This email and any attachments may contain confidential and privileged material 
for the sole use of the intended recipient. Any review, copying, or 
distribution of this email (or any attachments) by others is prohibited. If you 
are not the intended recipient, please contact the sender immediately and 
permanently delete this email and any attachments. No employee or agent of TiVo 
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by 
email. Binding agreements with TiVo Inc. may only be made by a signed written 
agreement.

Re: ZooKeeper Session Timeout During Startup Caused By Long ParNewGC Pauses

Reply via email to