[GitHub] [pulsar] Jason918 opened a new issue #12812: Optimize zookeeper client performance for loading amounts of topics.

GitBox Mon, 15 Nov 2021 05:30:33 -0800


Jason918 opened a new issue #12812:
URL: https://github.com/apache/pulsar/issues/12812

**Is your enhancement request related to a problem? Please describe.**

Optimize zookeeper client performance for loading amounts of topics.
Our use case is described here #12651 .

**Describe the solution you'd like**

Introduce [zk multi
ops](http://zookeeper.apache.org/doc/r3.7.0/apidocs/zookeeper-server/org/apache/zookeeper/ZooKeeper.html#multi-java.lang.Iterable-org.apache.zookeeper.AsyncCallback.MultiCallback-java.lang.Object-)
to optimize the zk client performance.

Our team did a perf test on the zk multi ops, here is the result.

|Op type | Single Op (ops/s) | Batch 8 with Multi (ops/s)|
|--|--|--|
|Create|25k|60k|
|Write(with pre-created 5M nodes)|25|90k|
|Read|25k|200k|

Here is the settings:
- Single node size : 256 Bytes
- zk server version: 3.4.6
- Cpu usage: 6/40 on core bare metal server.
- OS : centos 7
- ZK server disk: 8TB Nvme SSD

It's clear that with multi, we can achieve much more performance with the
same cpu usage, especially with read operations.

The basic idea of implementation will be add two queue (one for read ops and
one for write ops) in PulsarZooKeeperClient, all zk ops will be added to the
queue first, and a background thread will batch theses requests and sends to
zk server in one "multi op".
This implementation will introduce the follow parameter in broker configs:
- **enableAutoBatchZookeeperOps**, this feature is optional, as it may
increase metadata latency with small amount of topics.
- **autoBatchZookeeperOpsMaxNum** and **autoBatchZookeeperOpsMaxDelayMills**
Just like auto batching parameters in pulsar producer. Limits the max number
of ops in one batch and max delay time to wait for a batch.

**Describe alternatives you've considered**
Add a ratelimiter for topic loading, see #12651

**Additional context**
Here is the explanation why multi ops works much better with read ops.
- All write ops needs to go through a single thread processor in master.
And we have reach the max qps in the previous perf test.
- Read ops can be handled by slave nodes. We can get more qps with larger
batch size in the previous perf test.

Here is the flame graph of the bottleneck thread in master.

![image](https://user-images.githubusercontent.com/2770146/141787613-80659ec7-1b0a-450e-8249-44d3ff57e867.png)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [pulsar] Jason918 opened a new issue #12812: Optimize zookeeper client performance for loading amounts of topics.

Reply via email to