[ 
https://issues.apache.org/jira/browse/KAFKA-7148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wang updated KAFKA-7148:
------------------------
    Priority: Blocker  (was: Critical)

> Kafka load log very slow after goes down with outOfMemoryError
> --------------------------------------------------------------
>
>                 Key: KAFKA-7148
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7148
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.10.1.1
>            Reporter: wang
>            Priority: Blocker
>
> two questions
> 1、is there any idea to avoid the IOException: Map failed, or is there any 
> limit in vm mem size ?
> 2、is it normal when kafka load one partition log file  cost 20+ secend? 
>  
> Detail Info:
> 1、Linux Version :
>  kafka_2.11-0.10.1.1> cat /etc/SuSE-release
>  SUSE Linux Enterprise Server 11 (x86_64)
>  VERSION = 11
>  PATCHLEVEL = 3
>  
> 2、VM INFO :4C32G
>  
> 3、java -version
>  java version "1.8.0_131"
>  Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
>  
>  
> 4、Start command :
> java -Xmx16G -Xms16G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 
> -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC 
> -Djava.awt.headless=true 
> -Xloggc:/opt/***/kafka_2.11-0.10.1.1/bin/../logs/kafkaServer-gc.log 
> -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps 
> -Dkafka.logs.dir=/opt/***/kafka_2.11-0.10.1.1/bin/../logs 
> -Dlog4j.configuration=[file:./../config/log4j.properties|file://./config/log4j.properties]
>  -cp :………………………………
>  
>  
> 5、Broker&Topics
> we have 3 broker,3 zookeeper
>  we have 4 topics in this kafka cluster
>  __consumer_offsets 50 partition,3 replicate
>  topic1 5 partition,3 replicate
>  topic2 160 partition,3 replicate
>  topic3 160 partition,3 replicate
> Total data disk use :32G
>  du -sh data/
>  32G data/
>  
>  
> 6、logs
> [2018-07-10 17:23:59,728] FATAL [Replica Manager on Broker 1]: Halting due to 
> unrecoverable I/O error while handling produce request: 
> (kafka.server.ReplicaManager)
>  kafka.common.KafkaStorageException: I/O exception in append to log 
> '**************-13'
>  at kafka.log.Log.append(Log.scala:349)
>  at kafka.cluster.Partition$$anonfun$10.apply(Partition.scala:443)
>  at kafka.cluster.Partition$$anonfun$10.apply(Partition.scala:429)
>  at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
>  at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:240)
>  at kafka.cluster.Partition.appendMessagesToLeader(Partition.scala:429)
>  at 
> kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:407)
>  at 
> kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:393)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>  at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
>  at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
>  at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>  at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>  at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:393)
>  at kafka.server.ReplicaManager.appendMessages(ReplicaManager.scala:330)
>  at kafka.server.KafkaApis.handleProducerRequest(KafkaApis.scala:436)
>  at kafka.server.KafkaApis.handle(KafkaApis.scala:78)
>  at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
>  at java.lang.Thread.run(Thread.java:748)
>  Caused by: java.io.IOException: Map failed
>  at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:940)
>  at kafka.log.AbstractIndex.<init>(AbstractIndex.scala:61)
>  at kafka.log.OffsetIndex.<init>(OffsetIndex.scala:52)
>  at kafka.log.LogSegment.<init>(LogSegment.scala:67)
>  at kafka.log.Log.roll(Log.scala:778)
>  at kafka.log.Log.maybeRoll(Log.scala:744)
>  at kafka.log.Log.append(Log.scala:405)
>  ... 22 more
>  Caused by: java.lang.OutOfMemoryError: Map failed
>  at sun.nio.ch.FileChannelImpl.map0(Native Method)
>  at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:937)
>  ... 28 more
>  
>  
> 7、then I fllow this 
> (https://issues.apache.org/jira/browse/KAFKA-6165?jql=project%20%3D%20KAFKA%20AND%20text%20~%20%22sun.nio.ch.FileChannelImpl.map%20Map%20failed%22)
> add the vm.max_map_count from default value (65536)->262144
> /sbin/sysctl -a |grep map 
>  vm.max_map_count = 262144
> cat /proc/2860/maps |wc -l
>  1195
> and change the kafka-run-class.sh replaced `-XX:+DisableExplicitGC` with 
> `-XX:+ExplicitGCInvokesConcurrent`
>  
>  
> 8、but the "IOException: Map failed" problem still exist
> then we add vm mem server to 4C64G,and change the -Xmx16G -Xms16G to -Xmx4G 
> -Xms4G
>  
>  
> 9、load log slow log
> [2018-07-10 17:35:33,481] INFO Completed load of log ***-37 with 2 log 
> segments and log end offset 2441365 in 15678 ms (kafka.log.Log)
>  [2018-07-10 17:35:33,484] WARN Found a corrupted index file due to 
> requirement failed: Corrupt index found, index file 
> (/opt/***/data/***-34/00000000000002451611.index) has non-zero size but the 
> last offset is 2451611 which is no larger than the base offset 2451611.}. 
> deleting /opt/***/data/***-34/00000000000002451611.timeindex, 
> /opt/***/data/***-34/00000000000002451611.index and rebuilding index... 
> (kafka.log.Log)
>  [2018-07-10 17:35:33,485] INFO Recovering unflushed segment 1086746 in log 
> ***-34. (kafka.log.Log)
>  [2018-07-10 17:35:49,070] INFO Recovering unflushed segment 2451611 in log 
> ***-34. (kafka.log.Log)
>  [2018-07-10 17:35:49,071] INFO Completed load of log ***-34 with 2 log 
> segments and log end offset 2451613 in 15589 ms (kafka.log.Log)
>  [2018-07-10 17:35:49,073] WARN Found a corrupted index file due to 
> requirement failed: Corrupt index found, index file 
> (/opt/***/data/***-31/00000000000002437998.index) has non-zero size but the 
> last offset is 2437998 which is no larger than the base offset 2437998.}. 
> deleting /opt/***/data/***-31/00000000000002437998.timeindex, 
> /opt/***/data/***-31/00000000000002437998.index and rebuilding index... 
> (kafka.log.Log)
>  [2018-07-10 17:35:49,090] INFO Recovering unflushed segment 1080802 in log 
> ***-31. (kafka.log.Log)
>  [2018-07-10 17:36:04,806] INFO Recovering unflushed segment 2437998 in log 
> ***-31. (kafka.log.Log)
>  [2018-07-10 17:36:04,807] INFO Completed load of log ***-31 with 2 log 
> segments and log end offset 2438000 in 15735 ms (kafka.log.Log)
>  [2018-07-10 17:36:04,810] WARN Found a corrupted index file due to 
> requirement failed: Corrupt index found, index file 
> (/opt/***/data/***-28/00000000000002432077.index) has non-zero size but the 
> last offset is 2432077 which is no larger than the base offset 2432077.}. 
> deleting /opt/***/data/***-28/00000000000002432077.timeindex, 
> /opt/***/data/***-28/00000000000002432077.index and rebuilding index... 
> (kafka.log.Log)
>  [2018-07-10 17:36:04,811] INFO Recovering unflushed segment 1081522 in log 
> ***-28. (kafka.log.Log)
>  [2018-07-10 17:36:19,537] INFO Recovering unflushed segment 2432077 in log 
> ***-28. (kafka.log.Log)
>  [2018-07-10 17:36:19,538] INFO Completed load of log ***-28 with 2 log 
> segments and log end offset 2432079 in 14730 ms (kafka.log.Log)
>  [2018-07-10 17:36:19,539] WARN Found a corrupted index file due to 
> requirement failed: Corrupt index found, index file 
> (/opt/***/data/***-25/00000000000001080046.index) has non-zero size but the 
> last offset is 1080046 which is no larger than the base offset 1080046.}. 
> deleting /opt/***/data/***-25/00000000000001080046.timeindex, 
> /opt/***/data/***-25/00000000000001080046.index and rebuilding index... 
> (kafka.log.Log)
>  [2018-07-10 17:36:34,232] INFO Recovering unflushed segment 1080046 in log 
> ***-25. (kafka.log.Log)
>  [2018-07-10 17:36:48,261] INFO Completed load of log ***-25 with 1 log 
> segments and log end offset 2435166 in 28721 ms (kafka.log.Log)
>  [2018-07-10 17:36:48,266] WARN Found a corrupted index file due to 
> requirement failed: Corrupt index found, index file 
> (/opt/***/data/***-22/00000000000002438178.index) has non-zero size but the 
> last offset is 2438178 which is no larger than the base offset 2438178.}. 
> deleting /opt/***/data/***-22/00000000000002438178.timeindex, 
> /opt/***/data/***-22/00000000000002438178.index and rebuilding index... 
> (kafka.log.Log)
>  [2018-07-10 17:36:48,267] INFO Recovering unflushed segment 1081500 in log 
> ***-22. (kafka.log.Log)
>  [2018-07-10 17:37:02,803] INFO Recovering unflushed segment 2438178 in log 
> ***-22. (kafka.log.Log)
>  [2018-07-10 17:37:02,803] INFO Completed load of log ***-22 with 2 log 
> segments and log end offset 2438180 in 14540 ms (kafka.log.Log)
>  [2018-07-10 17:37:02,807] WARN Found a corrupted index file due to 
> requirement failed: Corrupt index found, index file 
> (/opt/***/data/***-19/00000000000001081412.index) has non-zero size but the 
> last offset is 1081412 which is no larger than the base offset 1081412.}. 
> deleting /opt/***/data/***-19/00000000000001081412.timeindex, 
> /opt/***/data/***-19/00000000000001081412.index and rebuilding index... 
> (kafka.log.Log)
>  [2018-07-10 17:37:17,536] INFO Recovering unflushed segment 1081412 in log 
> ***-19. (kafka.log.Log)
>  [2018-07-10 17:37:31,570] INFO Completed load of log ***-19 with 1 log 
> segments and log end offset 2443668 in 28764 ms (kafka.log.Log)
>  [2018-07-10 17:37:31,573] WARN Found a corrupted index file due to 
> requirement failed: Corrupt index found, index file 
> (/opt/***/data/***-35/00000000000002440999.index) has non-zero size but the 
> last offset is 2440999 which is no larger than the base offset 2440999.}. 
> deleting /opt/***/data/***-35/00000000000002440999.timeindex, 
> /opt/***/data/***-35/00000000000002440999.index and rebuilding index... 
> (kafka.log.Log)
>  
>  
> 10、then I check the cpu ,it's 100% use while load logs ,and use top -H -p and 
> jstack find the busy thread
> "pool-3-thread-1" #20 prio=5 os_prio=0 tid=0x00007f7041537000 nid=0x5c3a 
> runnable [0x00007f703042e000]
> java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
> at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
> at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:741)
> at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)
> at kafka.log.FileMessageSet$$anon$1.makeNext(FileMessageSet.scala:319)
> at kafka.log.FileMessageSet$$anon$1.makeNext(FileMessageSet.scala:308)
> at kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:64)
> at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:56)
> at kafka.log.LogSegment.recover(LogSegment.scala:211)
> at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:218)
> at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:179)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> at kafka.log.Log.loadSegments(Log.scala:179)
> at kafka.log.Log.<init>(Log.scala:108)
> at 
> kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$3$$anonfun$apply$10$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:151)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:58)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to