li4wang commented on PR #2086: URL: https://github.com/apache/zookeeper/pull/2086#issuecomment-1853151200
We also looked into how to enable Prometheus metrics in production and did quite a lot perf tests and profiling recently. The 1. The metrics queue size is 1M by default it can be tuned. 1M of queue size seems too large. We reduced the queue size from 1M to 100K, the max GC pause was reduced 78% and the GC counts was reduced 80% 2. We also noticed that when the thread pool queue is full, a large number of RejectedExecutionException instances was created, which added more GC overhead. This is because `ThreadPoolExecutor` uses `AbortPolicy` as the `RejectedExecutionHandler`. AbortPolicy instantiates RejectedExecutionException object and makes two quite involved `toString` calls. ``` public void rejectedExecution(Runnable r, ThreadPoolExecutor e) { throw new RejectedExecutionException("Task " + r.toString() + " rejected from " + e.toString()); } ``` 3. We created patch that uses the `DiscardPolicy` instead of `AbortPolicy`, which silently drop the rejected task instead of throwing `RejectedExecutionException`. With the patch, the max GC paused was reduced further about 7% and GC counts was reduced about 61% for 100K queue size. As a result, the latency of read operation was reduced 59% and throughput increased 140% . 2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@zookeeper.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org