Zhifeng Chen created KAFKA-16259: ------------------------------------ Summary: Immutable MetadataCache to improve client performance Key: KAFKA-16259 URL: https://issues.apache.org/jira/browse/KAFKA-16259 Project: Kafka Issue Type: Improvement Components: clients Affects Versions: 2.8.0 Reporter: Zhifeng Chen Attachments: image-2024-02-14-12-11-07-366.png
TL;DR, A Kafka client produce latency issue is identified caused by synchronized lock contention of metadata cache read/write in the native kafka producer. Trigger Condition: A producer need to produce to large number of topics. such as in kafka rest-proxy What is producer metadata cache Kafka producer maintains a in-memory copy of cluster metadata, and it avoided fetch metadata every time when produce message to reduce latency What’s the synchronized lock contention problem Kafka producer metadata cache is a *mutable* object, read/write are isolated by a synchronized lock. Which means when the metadata cache is being updated, all read requests are blocked. Topic metadata expiration frequency increase liner with number of topics. In a kafka cluster with large number of topic partitions, topic metadata expiration and refresh triggers high frequent metadata update. When read operation blocked by update, producer threads are blocked and caused high produce latency issue. *Proposed solution* TL;DR Optimize performance of metadata cache read operation of native kafka producer with copy-on-write strategy What is copy-on-write strategy It’s a solution to reduce synchronized lock contention by making the object immutable, and always create a new instance when updating, but since the object is immutable, read operation will be free from waiting, thus produce latency reduced significantly Besides performance, it can also make the metadata cache immutable from unexpected modification, reduce occurrence of code bugs due to incorrect synchronization Test result: Environment: Kafka-rest-proxy Client version: 2.8.0 Number of topic partitions: 250k test result show 90%+ latency reduction on test cluster !image-2024-02-14-12-11-07-366.png! P99 produce latency on deployed instances reduced from 200ms -> 5ms (upper part show latency after the improvement, lower part show before improvement) -- This message was sent by Atlassian Jira (v8.20.10#820010)