[ 
https://issues.apache.org/jira/browse/KAFKA-15046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17728275#comment-17728275
 ] 

Divij Vaidya commented on KAFKA-15046:
--------------------------------------

Yes that is right, leaderEpochCheckpoint is another I/O operation Kafka 
performs while holding the global partition lock. 

IMO, we need to move to async disk I/O using 
[io_uring|https://unixism.net/loti/what_is_io_uring.html] to prevent thread 
blocking (and lock contention) while performing disk I/O. That will be a single 
solution for all I/O operations we perform from Kafka.

Other alternative solutions are:
1. Delegate all disk flush operations to a separate thread pool in order to 
release the lock (this is the strategy that the currently open PR and current 
mechanism for disk flush follows)
2. Change the coarse grained lock over a single partition to fine grained 
locks. As an example, flushing the leading epoch should not require the same 
lock as writing to a segment since these are not dependent on each other.

[~junrao] do you have any thoughts on how to proceed ahead with fixing this?

> Produce performance issue under high disk load
> ----------------------------------------------
>
>                 Key: KAFKA-15046
>                 URL: https://issues.apache.org/jira/browse/KAFKA-15046
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 3.3.2
>            Reporter: Haruki Okada
>            Priority: Major
>              Labels: performance
>         Attachments: image-2023-06-01-12-46-30-058.png, 
> image-2023-06-01-12-52-40-959.png, image-2023-06-01-12-54-04-211.png, 
> image-2023-06-01-12-56-19-108.png
>
>
> * Phenomenon:
>  ** !image-2023-06-01-12-46-30-058.png|width=259,height=236!
>  ** Producer response time 99%ile got quite bad when we performed replica 
> reassignment on the cluster
>  *** RequestQueue scope was significant
>  ** Also request-time throttling happened at the incidental time. This caused 
> producers to delay sending messages in the mean time.
>  ** The disk I/O latency was higher than usual due to the high load for 
> replica reassignment.
>  *** !image-2023-06-01-12-56-19-108.png|width=255,height=128!
>  * Analysis:
>  ** The request-handler utilization was much higher than usual.
>  *** !image-2023-06-01-12-52-40-959.png|width=278,height=113!
>  ** Also, thread time utilization was much higher than usual on almost all 
> users
>  *** !image-2023-06-01-12-54-04-211.png|width=276,height=110!
>  ** From taking jstack several times, for most of them, we found that a 
> request-handler was doing fsync for flusing ProducerState and meanwhile other 
> request-handlers were waiting Log#lock for appending messages.
>  * 
>  ** 
>  *** 
> {code:java}
> "data-plane-kafka-request-handler-14" #166 daemon prio=5 os_prio=0 
> cpu=51264789.27ms elapsed=599242.76s tid=0x00007efdaeba7770 nid=0x1e704 
> runnable  [0x00007ef9a12e2000]
>    java.lang.Thread.State: RUNNABLE
>         at sun.nio.ch.FileDispatcherImpl.force0(java.base@11.0.17/Native 
> Method)
>         at 
> sun.nio.ch.FileDispatcherImpl.force(java.base@11.0.17/FileDispatcherImpl.java:82)
>         at 
> sun.nio.ch.FileChannelImpl.force(java.base@11.0.17/FileChannelImpl.java:461)
>         at 
> kafka.log.ProducerStateManager$.kafka$log$ProducerStateManager$$writeSnapshot(ProducerStateManager.scala:451)
>         at 
> kafka.log.ProducerStateManager.takeSnapshot(ProducerStateManager.scala:754)
>         at kafka.log.UnifiedLog.roll(UnifiedLog.scala:1544)
>         - locked <0x000000060d75d820> (a java.lang.Object)
>         at kafka.log.UnifiedLog.maybeRoll(UnifiedLog.scala:1523)
>         - locked <0x000000060d75d820> (a java.lang.Object)
>         at kafka.log.UnifiedLog.append(UnifiedLog.scala:919)
>         - locked <0x000000060d75d820> (a java.lang.Object)
>         at kafka.log.UnifiedLog.appendAsLeader(UnifiedLog.scala:760)
>         at 
> kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:1170)
>         at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:1158)
>         at 
> kafka.server.ReplicaManager.$anonfun$appendToLocalLog$6(ReplicaManager.scala:956)
>         at 
> kafka.server.ReplicaManager$$Lambda$2379/0x0000000800b7c040.apply(Unknown 
> Source)
>         at 
> scala.collection.StrictOptimizedMapOps.map(StrictOptimizedMapOps.scala:28)
>         at 
> scala.collection.StrictOptimizedMapOps.map$(StrictOptimizedMapOps.scala:27)
>         at scala.collection.mutable.HashMap.map(HashMap.scala:35)
>         at 
> kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:944)
>         at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:602)
>         at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:666)
>         at kafka.server.KafkaApis.handle(KafkaApis.scala:175)
>         at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:75)
>         at java.lang.Thread.run(java.base@11.0.17/Thread.java:829) {code}
>  * 
>  ** Also there were bunch of logs that writing producer snapshots took 
> hundreds of milliseconds.
>  *** 
> {code:java}
> ...
> [2023-05-01 11:08:36,689] INFO [ProducerStateManager partition=xxx-4] Wrote 
> producer snapshot at offset 1748817854 with 8 producer ids in 809 ms. 
> (kafka.log.ProducerStateManager)
> [2023-05-01 11:08:37,319] INFO [ProducerStateManager partition=yyy-34] Wrote 
> producer snapshot at offset 247996937813 with 0 producer ids in 547 ms. 
> (kafka.log.ProducerStateManager)
> [2023-05-01 11:08:38,887] INFO [ProducerStateManager partition=zzz-9] Wrote 
> producer snapshot at offset 226222355404 with 0 producer ids in 576 ms. 
> (kafka.log.ProducerStateManager)
> ... {code}
>  * From the analysis, we summarized the issue as below:
>  * 
>  ** 1. Disk write latency got worse due to the replica reassignment
>  *** We already use replication quota, and lowering the quota further may not 
> be acceptable for too long assignment duration
>  ** 2. ProducerStateManager#takeSnapshot started to take time due to fsync 
> latency
>  *** This is done at every log segment roll.
>  *** In our case, the broker hosts high load partitions so log roll is 
> occurring very frequently.
>  ** 3. During ProducerStateManager#takeSnapshot is doing fsync, all 
> subsequent produce requests to the partition is blocked due to Log#lock
>  ** 4. During produce requests waiting the lock, they consume request handler 
> threads time so it's accounted as thread and caused throttling
>  * Suggestion:
>  ** We didn't see this phenomenon when we used Kafka 2.4.1.
>  *** ProducerState fsync was introduced in 2.8.0 by this: 
> https://issues.apache.org/jira/browse/KAFKA-9892
>  ** The reason why ProducerState needs to be fsync is not well described in 
> above ticket though, we think fsync is not really necessary here. Because:
>  * 
>  ** 
>  *** If ProducerState snapshot file was not written to the disk after power 
> failure, it will be just rebuilt from logs.
>  *** Also, even if ProducerState snapshot was corrupted after power failure, 
> it will be rebuilt from logs thanks to CRC



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to