shenwenbing created KAFKA-10672: ----------------------------------- Summary: Restarting Kafka always takes a lot of time Key: KAFKA-10672 URL: https://issues.apache.org/jira/browse/KAFKA-10672 Project: Kafka Issue Type: Improvement Components: core Affects Versions: 2.0.0 Environment: A cluster of 21 Kafka nodes; Each node has 12 disks; Each node has about 1500 partitions; There are approximately 700 leader partitions per node; Slow-loading partitions have about 1000 log segments; Reporter: shenwenbing Attachments: server.log
When the snapshot file does not exist, or the latest snapshot file before the current active period, restoring the state of producers will traverse the log section, it will traverse the log all batch, in the period when the individual broker node partition number many, that there are most of the number of logs, can cause a lot of IO number, IO will only load one batch at a time, such as a log there will always be in the tens of thousands of batch, I found that in the code for each batch are at least two IO operation, when a batch as the default 16 KB,When a log segment is 1G, 65,536 batches will be generated, and then at least 65,536 *2= 131,072 IO operations will be generated, which will lead to a lot of time spent in kafka startup process. We configured 15 log recovery threads in the production environment, and it still took more than 2 hours to load a partition,can community puts forward some proposals to the situation or improve.For detailed logs, see the section on test-perf-18 partitions in the nearby logs -- This message was sent by Atlassian Jira (v8.3.4#803005)