[ 
https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731119#comment-17731119
 ] 

Jun Rao commented on KAFKA-15046:
---------------------------------

[~ocadaruma] : Thanks for identifying the problem. Another way to improve this 
is to move the LeaderEpochFile flushing logic to be part of the flushing of 
rolled segments. Currently, we flush each rolled segment in the background 
thread already. We could just make sure that we also flush the LeaderEpochFile 
there before advancing the recovery point. This way, we don't need to introduce 
a separate background task.

> Produce performance issue under high disk load
> ----------------------------------------------
>
>                 Key: KAFKA-15046
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15046
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 3.3.2
>            Reporter: Haruki Okada
>            Assignee: Haruki Okada
>            Priority: Major
>              Labels: performance
>         Attachments: image-2023-06-01-12-46-30-058.png, 
> image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, 
> image-2023-06-01-12-56-19-108.png
>
>
> * Phenomenon:
>  ** !image-2023-06-01-12-46-30-058.png|width=259,height=236!
>  ** Producer response time 99%ile got quite bad when we performed replica 
> reassignment on the cluster
>  *** RequestQueue scope was significant
>  ** Also request-time throttling happened at the incidental time. This caused 
> producers to delay sending messages in the mean time.
>  ** The disk I/O latency was higher than usual due to the high load for 
> replica reassignment.
>  *** !image-2023-06-01-12-56-19-108.png|width=255,height=128!
>  * Analysis:
>  ** The request-handler utilization was much higher than usual.
>  *** !image-2023-06-01-12-52-40-959.png|width=278,height=113!
>  ** Also, thread time utilization was much higher than usual on almost all 
> users
>  *** !image-2023-06-01-12-54-04-211.png|width=276,height=110!
>  ** From taking jstack several times, for most of them, we found that a 
> request-handler was doing fsync for flusing ProducerState and meanwhile other 
> request-handlers were waiting Log#lock for appending messages.
>  * 
>  ** 
>  *** 
> {code:java}
> "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 
> cpu=51264789.27ms elapsed=599242.76s tid=0x00007efdaeba7770 nid=0x1e704 
> runnable  [0x00007ef9a12e2000]
>    java.lang.Thread.State: RUNNABLE
>         at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native 
> Method)
>         at 
> sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82)
>         at 
> sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461)
>         at 
> kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451)
>         at 
> kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754)
>         at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544)
>         - locked <0x000000060d75d820> (a java.lang.Object)
>         at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523)
>         - locked <0x000000060d75d820> (a java.lang.Object)
>         at kafka.log.UnifiedLog.append(UnifiedLog.scala:919)
>         - locked <0x000000060d75d820> (a java.lang.Object)
>         at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760)
>         at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170)
>         at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158)
>         at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956)
>         at 
> kafka.server.ReplicaManager$$Lambda$2379/0x0000000800b7c040.apply(Unknown 
> Source)
>         at 
> scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
>         at 
> scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
>         at scala.collection.mutable.HashMap.map(HashMap.scala:35)
>         at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:944)
>         at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:602)
>         at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:666)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:175)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
>         at java.lang.Thread.run(java.base@11.0.17/Thread.java:829) {code}
>  * 
>  ** Also there were bunch of logs that writing producer snapshots took 
> hundreds of milliseconds.
>  *** 
> {code:java}
> ...
> [2023-05-01 11:08:36,689] INFO [ProducerStateManager partition=xxx-4] Wrote 
> producer snapshot at offset 1748817854 with 8 producer ids in 809 ms. 
> (kafka.log.ProducerStateManager)
> [2023-05-01 11:08:37,319] INFO [ProducerStateManager partition=yyy-34] Wrote 
> producer snapshot at offset 247996937813 with 0 producer ids in 547 ms. 
> (kafka.log.ProducerStateManager)
> [2023-05-01 11:08:38,887] INFO [ProducerStateManager partition=zzz-9] Wrote 
> producer snapshot at offset 226222355404 with 0 producer ids in 576 ms. 
> (kafka.log.ProducerStateManager)
> ... {code}
>  * From the analysis, we summarized the issue as below:
>  * 
>  ** 1. Disk write latency got worse due to the replica reassignment
>  *** We already use replication quota, and lowering the quota further may not 
> be acceptable for too long assignment duration
>  ** 2. ProducerStateManager#takeSnapshot started to take time due to fsync 
> latency
>  *** This is done at every log segment roll.
>  *** In our case, the broker hosts high load partitions so log roll is 
> occurring very frequently.
>  ** 3. During ProducerStateManager#takeSnapshot is doing fsync, all 
> subsequent produce requests to the partition is blocked due to Log#lock
>  ** 4. During produce requests waiting the lock, they consume request handler 
> threads time so it's accounted as thread-time utilization and caused 
> throttling
>  * Suggestion:
>  ** We didn't see this phenomenon when we used Kafka 2.4.1.
>  *** ProducerState fsync was introduced in 2.8.0 by this: 
> https://issues.apache.org/jira/browse/KAFKA-9892
>  ** The reason why ProducerState needs to be fsync is not well described in 
> above ticket though, we think fsync is not really necessary here. Because:
>  * 
>  ** 
>  *** If ProducerState snapshot file was not written to the disk after power 
> failure, it will be just rebuilt from logs.
>  *** Also, even if ProducerState snapshot was corrupted after power failure, 
> it will be rebuilt from logs thanks to CRC



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to