[
https://issues.apache.org/jira/browse/KAFKA-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
wang updated KAFKA-7148:
------------------------
Priority: Blocker (was: Critical)
> Kafka load log very slow after goes down with outOfMemoryError
> --------------------------------------------------------------
>
> Key: KAFKA-7148
> URL: https://issues.apache.org/jira/browse/KAFKA-7148
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 0.10.1.1
> Reporter: wang
> Priority: Blocker
>
> two questions
> 1、is there any idea to avoid the IOException: Map failed, or is there any
> limit in vm mem size ?
> 2、is it normal when kafka load one partition log file cost 20+ secend?
>
> Detail Info:
> 1、Linux Version :
> kafka_2.11-0.10.1.1> cat /etc/SuSE-release
> SUSE Linux Enterprise Server 11 (x86_64)
> VERSION = 11
> PATCHLEVEL = 3
>
> 2、VM INFO :4C32G
>
> 3、java -version
> java version "1.8.0_131"
> Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
>
>
> 4、Start command :
> java -Xmx16G -Xms16G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20
> -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC
> -Djava.awt.headless=true
> -Xloggc:/opt/***/kafka_2.11-0.10.1.1/bin/../logs/kafkaServer-gc.log
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
> -Dkafka.logs.dir=/opt/***/kafka_2.11-0.10.1.1/bin/../logs
> -Dlog4j.configuration=[file:./../config/log4j.properties|file://./config/log4j.properties]
> -cp :………………………………
>
>
> 5、Broker&Topics
> we have 3 broker,3 zookeeper
> we have 4 topics in this kafka cluster
> __consumer_offsets 50 partition,3 replicate
> topic1 5 partition,3 replicate
> topic2 160 partition,3 replicate
> topic3 160 partition,3 replicate
> Total data disk use :32G
> du -sh data/
> 32G data/
>
>
> 6、logs
> [2018-07-10 17:23:59,728] FATAL [Replica Manager on Broker 1]: Halting due to
> unrecoverable I/O error while handling produce request:
> (kafka.server.ReplicaManager)
> kafka.common.KafkaStorageException: I/O exception in append to log
> '**************-13'
> at kafka.log.Log.append(Log.scala:349)
> at kafka.cluster.Partition$$anonfun$10.apply(Partition.scala:443)
> at kafka.cluster.Partition$$anonfun$10.apply(Partition.scala:429)
> at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
> at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:240)
> at kafka.cluster.Partition.appendMessagesToLeader(Partition.scala:429)
> at
> kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:407)
> at
> kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:393)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:393)
> at kafka.server.ReplicaManager.appendMessages(ReplicaManager.scala:330)
> at kafka.server.KafkaApis.handleProducerRequest(KafkaApis.scala:436)
> at kafka.server.KafkaApis.handle(KafkaApis.scala:78)
> at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Map failed
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:940)
> at kafka.log.AbstractIndex.<init>(AbstractIndex.scala:61)
> at kafka.log.OffsetIndex.<init>(OffsetIndex.scala:52)
> at kafka.log.LogSegment.<init>(LogSegment.scala:67)
> at kafka.log.Log.roll(Log.scala:778)
> at kafka.log.Log.maybeRoll(Log.scala:744)
> at kafka.log.Log.append(Log.scala:405)
> ... 22 more
> Caused by: java.lang.OutOfMemoryError: Map failed
> at sun.nio.ch.FileChannelImpl.map0(Native Method)
> at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:937)
> ... 28 more
>
>
> 7、then I fllow this
> (https://issues.apache.org/jira/browse/KAFKA-6165?jql=project%20%3D%20KAFKA%20AND%20text%20~%20%22sun.nio.ch.FileChannelImpl.map%20Map%20failed%22)
> add the vm.max_map_count from default value (65536)->262144
> /sbin/sysctl -a |grep map
> vm.max_map_count = 262144
> cat /proc/2860/maps |wc -l
> 1195
> and change the kafka-run-class.sh replaced `-XX:+DisableExplicitGC` with
> `-XX:+ExplicitGCInvokesConcurrent`
>
>
> 8、but the "IOException: Map failed" problem still exist
> then we add vm mem server to 4C64G,and change the -Xmx16G -Xms16G to -Xmx4G
> -Xms4G
>
>
> 9、load log slow log
> [2018-07-10 17:35:33,481] INFO Completed load of log ***-37 with 2 log
> segments and log end offset 2441365 in 15678 ms (kafka.log.Log)
> [2018-07-10 17:35:33,484] WARN Found a corrupted index file due to
> requirement failed: Corrupt index found, index file
> (/opt/***/data/***-34/00000000000002451611.index) has non-zero size but the
> last offset is 2451611 which is no larger than the base offset 2451611.}.
> deleting /opt/***/data/***-34/00000000000002451611.timeindex,
> /opt/***/data/***-34/00000000000002451611.index and rebuilding index...
> (kafka.log.Log)
> [2018-07-10 17:35:33,485] INFO Recovering unflushed segment 1086746 in log
> ***-34. (kafka.log.Log)
> [2018-07-10 17:35:49,070] INFO Recovering unflushed segment 2451611 in log
> ***-34. (kafka.log.Log)
> [2018-07-10 17:35:49,071] INFO Completed load of log ***-34 with 2 log
> segments and log end offset 2451613 in 15589 ms (kafka.log.Log)
> [2018-07-10 17:35:49,073] WARN Found a corrupted index file due to
> requirement failed: Corrupt index found, index file
> (/opt/***/data/***-31/00000000000002437998.index) has non-zero size but the
> last offset is 2437998 which is no larger than the base offset 2437998.}.
> deleting /opt/***/data/***-31/00000000000002437998.timeindex,
> /opt/***/data/***-31/00000000000002437998.index and rebuilding index...
> (kafka.log.Log)
> [2018-07-10 17:35:49,090] INFO Recovering unflushed segment 1080802 in log
> ***-31. (kafka.log.Log)
> [2018-07-10 17:36:04,806] INFO Recovering unflushed segment 2437998 in log
> ***-31. (kafka.log.Log)
> [2018-07-10 17:36:04,807] INFO Completed load of log ***-31 with 2 log
> segments and log end offset 2438000 in 15735 ms (kafka.log.Log)
> [2018-07-10 17:36:04,810] WARN Found a corrupted index file due to
> requirement failed: Corrupt index found, index file
> (/opt/***/data/***-28/00000000000002432077.index) has non-zero size but the
> last offset is 2432077 which is no larger than the base offset 2432077.}.
> deleting /opt/***/data/***-28/00000000000002432077.timeindex,
> /opt/***/data/***-28/00000000000002432077.index and rebuilding index...
> (kafka.log.Log)
> [2018-07-10 17:36:04,811] INFO Recovering unflushed segment 1081522 in log
> ***-28. (kafka.log.Log)
> [2018-07-10 17:36:19,537] INFO Recovering unflushed segment 2432077 in log
> ***-28. (kafka.log.Log)
> [2018-07-10 17:36:19,538] INFO Completed load of log ***-28 with 2 log
> segments and log end offset 2432079 in 14730 ms (kafka.log.Log)
> [2018-07-10 17:36:19,539] WARN Found a corrupted index file due to
> requirement failed: Corrupt index found, index file
> (/opt/***/data/***-25/00000000000001080046.index) has non-zero size but the
> last offset is 1080046 which is no larger than the base offset 1080046.}.
> deleting /opt/***/data/***-25/00000000000001080046.timeindex,
> /opt/***/data/***-25/00000000000001080046.index and rebuilding index...
> (kafka.log.Log)
> [2018-07-10 17:36:34,232] INFO Recovering unflushed segment 1080046 in log
> ***-25. (kafka.log.Log)
> [2018-07-10 17:36:48,261] INFO Completed load of log ***-25 with 1 log
> segments and log end offset 2435166 in 28721 ms (kafka.log.Log)
> [2018-07-10 17:36:48,266] WARN Found a corrupted index file due to
> requirement failed: Corrupt index found, index file
> (/opt/***/data/***-22/00000000000002438178.index) has non-zero size but the
> last offset is 2438178 which is no larger than the base offset 2438178.}.
> deleting /opt/***/data/***-22/00000000000002438178.timeindex,
> /opt/***/data/***-22/00000000000002438178.index and rebuilding index...
> (kafka.log.Log)
> [2018-07-10 17:36:48,267] INFO Recovering unflushed segment 1081500 in log
> ***-22. (kafka.log.Log)
> [2018-07-10 17:37:02,803] INFO Recovering unflushed segment 2438178 in log
> ***-22. (kafka.log.Log)
> [2018-07-10 17:37:02,803] INFO Completed load of log ***-22 with 2 log
> segments and log end offset 2438180 in 14540 ms (kafka.log.Log)
> [2018-07-10 17:37:02,807] WARN Found a corrupted index file due to
> requirement failed: Corrupt index found, index file
> (/opt/***/data/***-19/00000000000001081412.index) has non-zero size but the
> last offset is 1081412 which is no larger than the base offset 1081412.}.
> deleting /opt/***/data/***-19/00000000000001081412.timeindex,
> /opt/***/data/***-19/00000000000001081412.index and rebuilding index...
> (kafka.log.Log)
> [2018-07-10 17:37:17,536] INFO Recovering unflushed segment 1081412 in log
> ***-19. (kafka.log.Log)
> [2018-07-10 17:37:31,570] INFO Completed load of log ***-19 with 1 log
> segments and log end offset 2443668 in 28764 ms (kafka.log.Log)
> [2018-07-10 17:37:31,573] WARN Found a corrupted index file due to
> requirement failed: Corrupt index found, index file
> (/opt/***/data/***-35/00000000000002440999.index) has non-zero size but the
> last offset is 2440999 which is no larger than the base offset 2440999.}.
> deleting /opt/***/data/***-35/00000000000002440999.timeindex,
> /opt/***/data/***-35/00000000000002440999.index and rebuilding index...
> (kafka.log.Log)
>
>
> 10、then I check the cpu ,it's 100% use while load logs ,and use top -H -p and
> jstack find the busy thread
> "pool-3-thread-1" #20 prio=5 os_prio=0 tid=0x00007f7041537000 nid=0x5c3a
> runnable [0x00007f703042e000]
> java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
> at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:741)
> at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)
> at kafka.log.FileMessageSet$$anon$1.makeNext(FileMessageSet.scala:319)
> at kafka.log.FileMessageSet$$anon$1.makeNext(FileMessageSet.scala:308)
> at kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:64)
> at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:56)
> at kafka.log.LogSegment.recover(LogSegment.scala:211)
> at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:218)
> at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:179)
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> at kafka.log.Log.loadSegments(Log.scala:179)
> at kafka.log.Log.<init>(Log.scala:108)
> at
> kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$3$$anonfun$apply$10$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:151)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:58)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)