[ https://issues.apache.org/jira/browse/KAFKA-6423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388711#comment-16388711 ]
Ivan Babrou commented on KAFKA-6423: ------------------------------------ [~doli], Is this solved by KAFKA-6175 and KAFKA-6172? > Slow shutdown with many open files > ---------------------------------- > > Key: KAFKA-6423 > URL: https://issues.apache.org/jira/browse/KAFKA-6423 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.11.0.1 > Reporter: Ivan Babrou > Priority: Major > > We have brokers with 20k open files and shutdown is extremely slow, > progressing usually at around 60 closed file descriptors per second: > {noformat} > $ while true; do echo $(date) $(sudo ls /proc/6363/fd | wc -l); sleep 1; done > Thu Jan 4 23:00:51 UTC 2018 9770 > Thu Jan 4 23:00:52 UTC 2018 9691 > Thu Jan 4 23:00:53 UTC 2018 9616 > Thu Jan 4 23:00:54 UTC 2018 9561 > Thu Jan 4 23:00:55 UTC 2018 9509 > Thu Jan 4 23:00:56 UTC 2018 9427 > Thu Jan 4 23:00:57 UTC 2018 9350 > Thu Jan 4 23:00:58 UTC 2018 9260 > Thu Jan 4 23:00:59 UTC 2018 9208 > {noformat} > If you strace the process, you can see: > {noformat} > $ sudo strace -f -c -p 6363 > strace: Process 6363 attached with 97 threads > ^Cstrace: Process 6363 detached > strace: Process 6604 detached > strace: Process 6605 detached > strace: Process 6606 detached > strace: Process 6607 detached > strace: Process 6608 detached > strace: Process 6609 detached > strace: Process 6610 detached > strace: Process 6611 detached > strace: Process 6612 detached > strace: Process 6613 detached > strace: Process 6614 detached > strace: Process 6615 detached > strace: Process 6616 detached > strace: Process 6617 detached > strace: Process 6618 detached > strace: Process 6619 detached > strace: Process 6620 detached > strace: Process 6621 detached > strace: Process 6622 detached > strace: Process 6623 detached > strace: Process 6624 detached > strace: Process 6625 detached > strace: Process 6626 detached > strace: Process 6627 detached > strace: Process 6628 detached > strace: Process 6629 detached > strace: Process 6630 detached > strace: Process 6631 detached > strace: Process 6632 detached > strace: Process 6633 detached > strace: Process 6634 detached > strace: Process 6635 detached > strace: Process 6636 detached > strace: Process 6637 detached > strace: Process 6638 detached > strace: Process 6639 detached > strace: Process 6640 detached > strace: Process 6641 detached > strace: Process 6642 detached > strace: Process 6643 detached > strace: Process 6644 detached > strace: Process 6645 detached > strace: Process 6646 detached > strace: Process 6647 detached > strace: Process 6648 detached > strace: Process 6649 detached > strace: Process 6650 detached > strace: Process 6651 detached > strace: Process 6652 detached > strace: Process 6653 detached > strace: Process 6654 detached > strace: Process 6655 detached > strace: Process 6656 detached > strace: Process 6657 detached > strace: Process 6658 detached > strace: Process 6659 detached > strace: Process 6660 detached > strace: Process 6661 detached > strace: Process 6662 detached > strace: Process 6663 detached > strace: Process 6716 detached > strace: Process 6717 detached > strace: Process 6718 detached > strace: Process 6719 detached > strace: Process 6720 detached > strace: Process 6721 detached > strace: Process 6722 detached > strace: Process 6723 detached > strace: Process 6724 detached > strace: Process 6725 detached > strace: Process 6726 detached > strace: Process 6727 detached > strace: Process 6728 detached > strace: Process 6729 detached > strace: Process 6730 detached > strace: Process 6731 detached > strace: Process 6732 detached > strace: Process 6733 detached > strace: Process 6734 detached > strace: Process 6735 detached > strace: Process 6736 detached > strace: Process 6737 detached > strace: Process 6738 detached > strace: Process 6739 detached > strace: Process 6740 detached > strace: Process 6741 detached > strace: Process 6760 detached > strace: Process 6779 detached > strace: Process 6781 detached > strace: Process 6783 detached > strace: Process 6892 detached > strace: Process 2339 detached > strace: Process 2340 detached > strace: Process 5122 detached > strace: Process 5123 detached > strace: Process 5652 detached > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 65.19 0.859302 35804 24 19 restart_syscall > 26.60 0.350656 507 692 190 futex > 5.17 0.068142 2271 30 epoll_wait > 1.22 0.016141 56 287 ftruncate > 0.66 0.008679 20 432 close > 0.38 0.005054 35 144 fsync > 0.26 0.003489 12 288 open > 0.19 0.002564 4 720 fstat > 0.15 0.001914 3 576 lseek > 0.14 0.001885 13 144 mmap > 0.01 0.000114 7 16 getrusage > 0.00 0.000060 15 4 write > 0.00 0.000028 5 6 read > 0.00 0.000027 2 12 mprotect > 0.00 0.000019 5 4 epoll_ctl > ------ ----------- ----------- --------- --------- ---------------- > 100.00 1.318074 3379 209 total > ivan@36s116:~$ htop > ivan@36s116:~$ sudo strace -f -c -p 6363 > strace: Process 6363 attached with 97 threads > ^Cstrace: Process 6363 detached > strace: Process 6604 detached > strace: Process 6605 detached > strace: Process 6606 detached > strace: Process 6607 detached > strace: Process 6608 detached > strace: Process 6609 detached > strace: Process 6610 detached > strace: Process 6611 detached > strace: Process 6612 detached > strace: Process 6613 detached > strace: Process 6614 detached > strace: Process 6615 detached > strace: Process 6616 detached > strace: Process 6617 detached > strace: Process 6618 detached > strace: Process 6619 detached > strace: Process 6620 detached > strace: Process 6621 detached > strace: Process 6622 detached > strace: Process 6623 detached > strace: Process 6624 detached > strace: Process 6625 detached > strace: Process 6626 detached > strace: Process 6627 detached > strace: Process 6628 detached > strace: Process 6629 detached > strace: Process 6630 detached > strace: Process 6631 detached > strace: Process 6632 detached > strace: Process 6633 detached > strace: Process 6634 detached > strace: Process 6635 detached > strace: Process 6636 detached > strace: Process 6637 detached > strace: Process 6638 detached > strace: Process 6639 detached > strace: Process 6640 detached > strace: Process 6641 detached > strace: Process 6642 detached > strace: Process 6643 detached > strace: Process 6644 detached > strace: Process 6645 detached > strace: Process 6646 detached > strace: Process 6647 detached > strace: Process 6648 detached > strace: Process 6649 detached > strace: Process 6650 detached > strace: Process 6651 detached > strace: Process 6652 detached > strace: Process 6653 detached > strace: Process 6654 detached > strace: Process 6655 detached > strace: Process 6656 detached > strace: Process 6657 detached > strace: Process 6658 detached > strace: Process 6659 detached > strace: Process 6660 detached > strace: Process 6661 detached > strace: Process 6662 detached > strace: Process 6663 detached > strace: Process 6716 detached > strace: Process 6717 detached > strace: Process 6718 detached > strace: Process 6719 detached > strace: Process 6720 detached > strace: Process 6721 detached > strace: Process 6722 detached > strace: Process 6723 detached > strace: Process 6724 detached > strace: Process 6725 detached > strace: Process 6726 detached > strace: Process 6727 detached > strace: Process 6728 detached > strace: Process 6729 detached > strace: Process 6730 detached > strace: Process 6731 detached > strace: Process 6732 detached > strace: Process 6733 detached > strace: Process 6734 detached > strace: Process 6735 detached > strace: Process 6736 detached > strace: Process 6737 detached > strace: Process 6738 detached > strace: Process 6739 detached > strace: Process 6740 detached > strace: Process 6741 detached > strace: Process 6760 detached > strace: Process 6779 detached > strace: Process 6781 detached > strace: Process 6783 detached > strace: Process 6892 detached > strace: Process 2339 detached > strace: Process 2340 detached > strace: Process 5122 detached > strace: Process 5123 detached > strace: Process 5652 detached > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 61.71 73.863265 9074 8140 3060 futex > 32.70 39.143004 65 599436 sched_yield > 3.02 3.610581 16337 221 epoll_wait > 1.93 2.314673 89026 26 5 restart_syscall > 0.18 0.211991 211991 1 accept > 0.13 0.157244 45 3492 ftruncate > 0.11 0.127528 24 5285 close > 0.06 0.072592 41 1760 fsync > 0.04 0.047197 13 3538 open > 0.03 0.036148 4 8776 fstat > 0.03 0.035549 6 5715 stat > 0.02 0.026905 317 85 getdents > 0.02 0.026150 4 6988 2 lseek > 0.02 0.025100 14 1742 mmap > 0.00 0.001132 15 75 2 read > 0.00 0.000662 14 46 write > 0.00 0.000639 23 28 msync > 0.00 0.000580 6 100 getrusage > 0.00 0.000443 5 95 mprotect > 0.00 0.000212 6 34 epoll_ctl > 0.00 0.000099 12 8 sysinfo > 0.00 0.000050 13 4 times > 0.00 0.000036 9 4 getrlimit > 0.00 0.000023 12 2 getsockname > 0.00 0.000015 8 2 uname > 0.00 0.000013 4 3 fcntl > 0.00 0.000013 7 2 sched_getaffinity > 0.00 0.000007 4 2 setsockopt > 0.00 0.000000 0 2 writev > 0.00 0.000000 0 1 dup2 > 0.00 0.000000 0 2 1 shutdown > 0.00 0.000000 0 2 clock_gettime > ------ ----------- ----------- --------- --------- ---------------- > 100.00 119.701851 645617 3070 total > {noformat} > In jstack I can see: > {noformat} > "pool-6-thread-1" #219 prio=5 os_prio=0 tid=0x00007f44d80a2800 nid=0x1614 > runnable [0x00007f45690a6000] > java.lang.Thread.State: RUNNABLE > at java.io.RandomAccessFile.close0(Native Method) > at java.io.RandomAccessFile.access$000(RandomAccessFile.java:59) > at java.io.RandomAccessFile$1.close(RandomAccessFile.java:619) > at java.io.FileDescriptor.closeAll(FileDescriptor.java:212) > - locked <0x000000064c651c10> (a java.io.FileDescriptor) > at java.io.RandomAccessFile.close(RandomAccessFile.java:617) > at sun.nio.ch.FileChannelImpl.implCloseChannel(FileChannelImpl.java:139) > at > java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115) > - locked <0x000000064c651be0> (a java.lang.Object) > at > org.apache.kafka.common.record.FileRecords.close(FileRecords.java:171) > at kafka.log.LogSegment.$anonfun$close$4(LogSegment.scala:490) > at kafka.log.LogSegment$$Lambda$1176/84191269.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:78) > at kafka.utils.Logging.swallowWarn(Logging.scala:94) > at kafka.utils.Logging.swallowWarn$(Logging.scala:93) > at kafka.utils.CoreUtils$.swallowWarn(CoreUtils.scala:48) > at kafka.utils.Logging.swallow(Logging.scala:96) > at kafka.utils.Logging.swallow$(Logging.scala:96) > at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:48) > at kafka.log.LogSegment.close(LogSegment.scala:490) > at kafka.log.Log.$anonfun$close$2(Log.scala:537) > at kafka.log.Log.$anonfun$close$2$adapted(Log.scala:537) > at kafka.log.Log$$Lambda$1172/1976013682.apply(Unknown Source) > at scala.collection.Iterator.foreach(Iterator.scala:929) > at scala.collection.Iterator.foreach$(Iterator.scala:929) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1417) > at scala.collection.IterableLike.foreach(IterableLike.scala:71) > at scala.collection.IterableLike.foreach$(IterableLike.scala:70) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at kafka.log.Log.close(Log.scala:537) > - locked <0x0000000640488d10> (a java.lang.Object) > at kafka.log.LogManager.$anonfun$shutdown$7(LogManager.scala:280) > at kafka.log.LogManager$$Lambda$1166/165872442.apply$mcV$sp(Unknown > Source) > at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > CPU is barely used at ~2%. > There is some IO, but it's very far from saturating RAID0 from 12 spinning > disks: > https://i.imgur.com/jRuW3pp.png > This happens across brokers and clusters. -- This message was sent by Atlassian JIRA (v7.6.3#76005)