This problem is so mysterious to me that I am hesitant to make any definitive statements regarding the scope of the issue. In the staging environment where the problem was originally seen the GC logs were not enabled. In that environment the session timeouts persisted for a long time (and operations tried several restarts to clear the issue). Here are some sample logs:
$ grep -e starting -e "session time" uverse-staging-admin-0-20150423150103-kafka.log [2015-04-23 14:23:53,837] INFO Client session timed out, have not heard from server in 10066ms for sessionid 0x14ce69c464f0009, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2015-04-23 14:31:56,133] INFO [Kafka Server 0], starting (kafka.server.KafkaServer) [2015-04-23 14:32:08,636] INFO Client session timed out, have not heard from server in 10422ms for sessionid 0x14ce6b1209f0002, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2015-04-23 14:32:21,641] INFO Client session timed out, have not heard from server in 10643ms for sessionid 0x14ce6b1209f0003, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2015-04-23 14:32:33,614] INFO Client session timed out, have not heard from server in 9842ms for sessionid 0x14ce6b1209f0004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2015-04-23 14:32:45,589] INFO Client session timed out, have not heard from server in 9878ms for sessionid 0x14ce6b1209f0005, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2015-04-23 14:32:57,906] INFO Client session timed out, have not heard from server in 9823ms for sessionid 0x14ce6b1209f0006, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2015-04-23 14:33:10,701] INFO Client session timed out, have not heard from server in 10365ms for sessionid 0x14ce6b1209f0007, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2015-04-23 14:33:22,475] INFO Client session timed out, have not heard from server in 10006ms for sessionid 0x14ce6b1209f0008, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2015-04-23 14:33:46,372] INFO Client session timed out, have not heard from server in 11062ms for sessionid 0x14ce6b1209f0009, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2015-04-23 14:42:11,942] INFO Client session timed out, have not heard from server in 12582ms for sessionid 0x14ce6b1209f000a, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2015-04-23 14:55:47,273] INFO Client session timed out, have not heard from server in 12935ms for sessionid 0x14ce6b1209f000b, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) The specific configuration of Kafka (heap size, number of topics, ...) seem to have a big influence on the behavior. Using the test program I have generally seen the ParNew GCs times eventually recover, but, it can take a long time (and sometimes I kill the process before it recovers). Here is an example of one that persists for 2 ParNew GC cycles: [ec2-user@ip-10-136-75-2 ~]$ jdk1.8.0_45/bin/java -Xmx4G -Xms4G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails LongParNewPause $((500*1024*1024)) 100 10 0.289: [GC (Allocation Failure) 0.289: [ParNew: 272640K->27707K(306688K), 0.0322014 secs] 784640K->539707K(4160256K), 0.0322659 secs] [Times: user=0.10 sys=0.03, real=0.04 secs] 0.360: [GC (Allocation Failure) 0.360: [ParNew: 300347K->34048K(306688K), 3.5932171 secs] 812347K->572709K(4160256K), 3.5932638 secs] [Times: user=14.12 sys=0.05, real=3.60 secs] 3.992: [GC (Allocation Failure) 3.992: [ParNew: 306688K->34048K(306688K), 53.1317959 secs] 845349K->599722K(4160256K), 53.1318434 secs] [Times: user=212.40 sys=0.00, real=53.13 secs] 57.162: [GC (Allocation Failure) 57.162: [ParNew: 306688K->34048K(306688K), 0.0500041 secs] 872362K->626660K(4160256K), 0.0500528 secs] [Times: user=0.17 sys=0.01, real=0.05 secs] 57.250: [GC (Allocation Failure) 57.250: [ParNew: 306688K->34048K(306688K), 0.0503560 secs] 899300K->658186K(4160256K), 0.0504036 secs] [Times: user=0.17 sys=0.01, real=0.05 secs] 57.339: [GC (Allocation Failure) 57.339: [ParNew: 306688K->34048K(306688K), 0.0510655 secs] 930826K->687994K(4160256K), 0.0511131 secs] [Times: user=0.17 sys=0.00, real=0.05 secs] 57.428: [GC (Allocation Failure) 57.428: [ParNew: 306688K->34048K(306688K), 0.0516372 secs] 960634K->720213K(4160256K), 0.0516853 secs] [Times: user=0.17 sys=0.01, real=0.05 secs] Just changing the fraction of objects saved from "1 out of 10" to "1 out of 100" can dramatically increase the recovery time to 11 ParNew GCs: [ec2-user@ip-10-136-75-2 ~]$ jdk1.8.0_45/bin/java -Xmx4G -Xms4G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails LongParNewPause $((500*1024*1024)) 100 100 0.290: [GC (Allocation Failure) 0.290: [ParNew: 272640K->3052K(306688K), 0.0161574 secs] 784640K->515052K(4160256K), 0.0162239 secs] [Times: user=0.05 sys=0.02, real=0.01 secs] 0.346: [GC (Allocation Failure) 0.346: [ParNew: 275692K->7135K(306688K), 0.0156640 secs] 787692K->519135K(4160256K), 0.0157112 secs] [Times: user=0.06 sys=0.00, real=0.02 secs] 0.401: [GC (Allocation Failure) 0.401: [ParNew: 279775K->10158K(306688K), 0.0162892 secs] 791775K->522158K(4160256K), 0.0163360 secs] [Times: user=0.06 sys=0.00, real=0.02 secs] 0.457: [GC (Allocation Failure) 0.457: [ParNew: 282798K->15495K(306688K), 0.0174881 secs] 794798K->527495K(4160256K), 0.0175358 secs] [Times: user=0.07 sys=0.00, real=0.02 secs] 0.513: [GC (Allocation Failure) 0.513: [ParNew: 288135K->21121K(306688K), 0.0187323 secs] 800135K->533121K(4160256K), 0.0187802 secs] [Times: user=0.08 sys=0.01, real=0.02 secs] 0.571: [GC (Allocation Failure) 0.571: [ParNew: 293761K->26920K(306688K), 0.0193083 secs] 805761K->538920K(4160256K), 0.0193544 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 0.630: [GC (Allocation Failure) 0.630: [ParNew: 299560K->21225K(306688K), 0.8494763 secs] 811560K->536148K(4160256K), 0.8495230 secs] [Times: user=3.38 sys=0.00, real=0.85 secs] 1.519: [GC (Allocation Failure) 1.519: [ParNew: 293865K->21322K(306688K), 4.5105759 secs] 808788K->538970K(4160256K), 4.5106276 secs] [Times: user=18.02 sys=0.01, real=4.51 secs] 6.069: [GC (Allocation Failure) 6.069: [ParNew: 293962K->21348K(306688K), 9.0200268 secs] 811610K->541719K(4160256K), 9.0200810 secs] [Times: user=36.04 sys=0.00, real=9.02 secs] 15.128: [GC (Allocation Failure) 15.128: [ParNew: 293988K->21357K(306688K), 14.4123763 secs] 814359K->544451K(4160256K), 14.4124320 secs] [Times: user=57.61 sys=0.00, real=14.41 secs] 29.580: [GC (Allocation Failure) 29.580: [ParNew: 293997K->21833K(306688K), 19.8231772 secs] 817091K->547646K(4160256K), 19.8232320 secs] [Times: user=79.23 sys=0.01, real=19.82 secs] 49.442: [GC (Allocation Failure) 49.442: [ParNew: 294473K->21827K(306688K), 24.5076227 secs] 820286K->550367K(4160256K), 24.5076707 secs] [Times: user=97.96 sys=0.00, real=24.51 secs] 73.989: [GC (Allocation Failure) 73.989: [ParNew: 294467K->21825K(306688K), 29.8300475 secs] 823007K->553082K(4160256K), 29.8300951 secs] [Times: user=119.25 sys=0.01, real=29.83 secs] 103.858: [GC (Allocation Failure) 103.858: [ParNew: 294465K->25919K(306688K), 36.1503266 secs] 825722K->559903K(4160256K), 36.1503960 secs] [Times: user=144.50 sys=0.00, real=36.15 secs] 140.048: [GC (Allocation Failure) 140.048: [ParNew: 298559K->30008K(306688K), 42.2171055 secs] 832543K->566717K(4160256K), 42.2171619 secs] [Times: user=168.78 sys=0.00, real=42.21 secs] 182.305: [GC (Allocation Failure) 182.305: [ParNew: 302648K->22540K(306688K), 48.9293150 secs] 839357K->561976K(4160256K), 48.9293635 secs] [Times: user=195.62 sys=0.01, real=48.93 secs] 231.273: [GC (Allocation Failure) 231.273: [ParNew: 295180K->31702K(306688K), 54.8912986 secs] 834616K->573850K(4160256K), 54.8913474 secs] [Times: user=219.46 sys=0.01, real=54.89 secs] 286.204: [GC (Allocation Failure) 286.204: [ParNew: 304342K->21761K(306688K), 0.0212730 secs] 846490K->566635K(4160256K), 0.0213218 secs] [Times: user=0.07 sys=0.00, real=0.02 secs] 286.264: [GC (Allocation Failure) 286.264: [ParNew: 294401K->31528K(306688K), 0.0204699 secs] 839275K->579129K(4160256K), 0.0205177 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] 286.324: [GC (Allocation Failure) 286.324: [ParNew: 304168K->21717K(306688K), 0.0199914 secs] 851769K->572045K(4160256K), 0.0200406 secs] [Times: user=0.07 sys=0.00, real=0.02 secs] 286.383: [GC (Allocation Failure) 286.383: [ParNew: 294357K->31518K(306688K), 0.0202584 secs] 844685K->584571K(4160256K), 0.0203057 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] Interesting the last slow time in both cases is about the same: 53 to 54 seconds. I have seen that type of pattern hold up with different heap sizes (although the max time increases with heap size). Probably a clue to understanding the underlying problem. I just tried 1 out of 1000 and it is still running: [ec2-user@ip-10-136-75-2 ~]$ jdk1.8.0_45/bin/java -Xmx4G -Xms4G -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails LongParNewPause $((500*1024*1024)) 100 1000 0.285: [GC (Allocation Failure) 0.285: [ParNew: 272640K->548K(306688K), 0.0134350 secs] 784640K->512548K(4160256K), 0.0134952 secs] [Times: user=0.04 sys=0.00, real=0.01 secs] 0.335: [GC (Allocation Failure) 0.335: [ParNew: 273188K->1141K(306688K), 0.0122108 secs] 785188K->513141K(4160256K), 0.0122587 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 0.383: [GC (Allocation Failure) 0.383: [ParNew: 273781K->1443K(306688K), 0.0124031 secs] 785781K->513443K(4160256K), 0.0124503 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 0.431: [GC (Allocation Failure) 0.431: [ParNew: 274083K->1754K(306688K), 0.0122479 secs] 786083K->513754K(4160256K), 0.0122919 secs] [Times: user=0.05 sys=0.00, real=0.01 secs] 0.479: [GC (Allocation Failure) 0.479: [ParNew: 274394K->2553K(306688K), 0.0125259 secs] 786394K->514553K(4160256K), 0.0125721 secs] [Times: user=0.05 sys=0.00, real=0.02 secs] 0.527: [GC (Allocation Failure) 0.527: [ParNew: 275193K->3127K(306688K), 0.0130623 secs] 787193K->515127K(4160256K), 0.0131086 secs] [Times: user=0.05 sys=0.00, real=0.02 secs] 0.575: [GC (Allocation Failure) 0.576: [ParNew: 275767K->3219K(306688K), 0.6621319 secs] 787767K->515744K(4160256K), 0.6621784 secs] [Times: user=2.64 sys=0.00, real=0.66 secs] ... [ 38 ParNew logs deleted ] ... 405.220: [GC (Allocation Failure) 405.220: [ParNew: 274901K->2669K(306688K), 20.4550033 secs] 798938K->526979K(4160256K), 20.4550536 secs] [Times: user=81.77 sys=0.01, real=20.45 secs] 425.711: [GC (Allocation Failure) 425.711: [ParNew: 275309K->2669K(306688K), 20.9100008 secs] 799619K->527251K(4160256K), 20.9100495 secs] [Times: user=83.59 sys=0.01, real=20.91 secs] 446.657: [GC (Allocation Failure) 446.657: [ParNew: 275309K->2669K(306688K), 22.5119299 secs] 799891K->527524K(4160256K), 22.5119871 secs] [Times: user=89.90 sys=0.02, real=22.51 secs] 469.204: [GC (Allocation Failure) 469.204: [ParNew: 275309K->2669K(306688K), 21.8446029 secs] 800164K->527797K(4160256K), 21.8446524 secs] [Times: user=87.31 sys=0.01, real=21.84 secs] 491.085: [GC (Allocation Failure) 491.085: [ParNew James. Thanks for reporting this. So the issue only shows up at Kafka startup time? After that, the GC time is back to normal again? Jun On Tue, Apr 28, 2015 at 1:48 PM, James Lent <jl...@tivo.com><mailto:jl...@tivo.com> wrote: I have filed a bug report with Oracle on this issue today. I have not yet been assigned a bug number. I thought the issue might be of general interest to the Kafka community and I have a suggestion for a configuration change that works around the issue in case other people tun into it. First the issue I am about to describe will be hard to believe, but, I do provide a short Java program that demonstrates the JVM GC issue that triggers the long ParNewGC pauses under certain conditions. Before you dismiss the analysis out of hand I suggest you try try test program below. I have tested it in multiple environments: linux dev box, linux laptop, and a linux EC2 instance. Other engineers have also done so in their own environments. I would be interested to know what you see. First the conditions: 1) More than a 1G heap (we ran into it when we increased the heap from 1G to 2G). 2) Requires at least some topics defined. Not sure the exact amount. We don't see this issue in all our environments. 3) Log compaction enabled. 4) Using JVM other than Oracle 6. I have reproduced the underlying issue (with the test program below) using OpenJDK 6, Oracle 7, OpenJDK 7, and Oracle 8 and OpenJDK 8. Under these conditions we can see 9 second or longer pauses due to individual ParNew GC events (which is greater than the default 6 second ZooKeeper session timeout). It tends to be very reproducible in a given environment (e.g. either always happens or it never happens). We are using Kafka 0.8.1.1. The long ParNewGC delays are triggered by the allocation of a large static object during startup. Specially the SkimpyOffsetMap bytes buffer. There is no reason why this should be the case and that is why I have opened a JVM bug with Oracle. The even more unbelievable part is this bug only seems to trigger a problem for static objects in the 480M to 512M range. The default size of this byte buffer is 500M. Our current workaround it to configure this buffer to use 400M instead. We have verified that this makes the issue go away in multiple "real" Kafka environments. Using the G1 GC is another option we are considering (and seems to work). Here is the test program (note that the size of the static object matches the default size of the SkimpyOffsetMap bytes buffer): import java.util.ArrayList; public class LongParNewPause { static byte[] bigStaticObject; public static void main(String[] args) throws Exception { int bigObjSize = args.length > 0 ? Integer.parseInt(args[0]) : 524288000; int littleObjSize = args.length > 1 ? Integer.parseInt(args[1]) : 100; int saveFraction = args.length > 2 ? Integer.parseInt(args[2]) : 15; bigStaticObject = new byte[bigObjSize]; ArrayList<byte[]> holder = new ArrayList<byte[]>(); int i = 0; while (true) { byte[] local = new byte[littleObjSize]; if (i++ % saveFraction == 0) { holder.add(local); } } } } You should run it with the following JVM options: -verbose:gc -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Xmx2G -Xms2G Here is what I see using Oracle 8: 0.355: [GC (Allocation Failure) 0.355: [ParNew: 272640K->18193K(306688K), 0.0265380 secs] 784640K->530193K(4160256K), 0.0266064 secs] [Times: user=0.07 sys=0.00, real=0.03 secs] 0.420: [GC (Allocation Failure) 0.420: [ParNew: 290833K->23124K(306688K), 11.4522887 secs] 802833K->552573K(4160256K), 11.4523296 secs] [Times: user=42.51 sys=0.18, real=11.45 secs] 11.902: [GC (Allocation Failure) 11.902: [ParNew: 295764K->33361K(306688K), 25.6937427 secs] 825213K->580830K(4160256K), 25.6937906 secs] [Times: user=100.63 sys=0.11, real=25.70 secs] 37.627: [GC (Allocation Failure) 37.627: [ParNew: 306001K->25620K(306688K), 0.0219097 secs] 853470K->591124K(4160256K), 0.0219448 secs] [Times: user=0.08 sys=0.00, real=0.02 secs] ... If I pass in 500,000,000 for the size: 0.344: [GC (Allocation Failure) 0.344: [ParNew: 272640K->18181K(306688K), 0.0253515 secs] 760921K->506462K(4160256K), 0.0253957 secs] [Times: user=0.04 sys=0.01, real=0.02 secs] 0.408: [GC (Allocation Failure) 0.408: [ParNew: 290821K->31804K(306688K), 0.0499212 secs] 779102K->537534K(4160256K), 0.0499556 secs] [Times: user=0.08 sys=0.01, real=0.05 secs] 0.491: [GC (Allocation Failure) 0.491: [ParNew: 304444K->27084K(306688K), 0.0286135 secs] 810174K->550834K(4160256K), 0.0286590 secs] [Times: user=0.08 sys=0.00, real=0.03 secs] 0.553: [GC (Allocation Failure) 0.554: [ParNew: 299724K->27617K(306688K), 0.0387482 secs] 823474K->569401K(4160256K), 0.0387790 secs] [Times: user=0.07 sys=0.01, real=0.04 secs] 0.625: [GC (Allocation Failure) 0.625: [ParNew: 300257K->21936K(306688K), 0.0368052 secs] 842041K->584850K(4160256K), 0.0368367 secs] [Times: user=0.08 sys=0.00, real=0.03 secs] ... Playing with the other two parameters can vary the results, but, they are all bad if the size of the static object is in the "bad" range. James Lent Senior Software Engineer Digitalsmiths A TiVo Company www.digitalsmiths.com<http://www.digitalsmiths.com><http://www.digitalsmiths.com/><http://www.digitalsmiths.com/> jl...@digitalsmiths.com<mailto:jl...@digitalsmiths.com><mailto:jl...@digitalsmiths.com><mailto:jl...@digitalsmiths.com> | office 919.460.4747 ________________________________ This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement. James Lent Senior Software Engineer Digitalsmiths A TiVo Company www.digitalsmiths.com<http://www.digitalsmiths.com/> jl...@digitalsmiths.com<mailto:jl...@digitalsmiths.com> | office 919.460.4747 ________________________________ This email and any attachments may contain confidential and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments) by others is prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete this email and any attachments. No employee or agent of TiVo Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by email. Binding agreements with TiVo Inc. may only be made by a signed written agreement.