Hi folks, I ran some rough benchmarks to get an idea of where Zaqar currently stands re latency and throughput for Juno. These results are by no means conclusive, but I wanted to publish what I had so far for the sake of discussion.
Note that these tests do not include results for our new Redis driver, but I hope to make those available soon. As always, the usual disclaimers apply (i.e., benchmarks mostly amount to lies; these numbers are only intended to provide a ballpark reference; you should perform your own tests, simulating your specific scenarios and using your own hardware; etc.). ## Setup ## Rather than VMs, I provisioned some Rackspace OnMetal servers to mitigate noisy neighbor when running the performance tests: * 1x Load Generator * Hardware * 1x Intel Xeon E5-2680 v2 2.8Ghz * 32 GB RAM * 10Gbps NIC * 32GB SATADOM * Software * Debian Wheezy * Python 2.7.3 * zaqar-bench from trunk with some extra patches * 1x Web Head * Hardware * 1x Intel Xeon E5-2680 v2 2.8Ghz * 32 GB RAM * 10Gbps NIC * 32GB SATADOM * Software * Debian Wheezy * Python 2.7.3 * zaqar server from trunk @47e07cad * storage=mongodb * partitions=4 * MongoDB URI configured with w=majority * uWSGI + gevent * config: http://paste.openstack.org/show/100592/ * app.py: http://paste.openstack.org/show/100593/ * 3x MongoDB Nodes * Hardware * 2x Intel Xeon E5-2680 v2 2.8Ghz * 128 GB RAM * 10Gbps NIC * 2x LSI Nytro WarpDrive BLP4-1600 * Software * Debian Wheezy * mongod 2.6.4 * Default config, except setting replSet and enabling periodic logging of CPU and I/O * Journaling enabled * Profiling on message DBs enabled for requests over 10ms For generating the load, I used the zaqar-bench tool we created during Juno as a stepping stone toward integration with Rally. Although the tool is still fairly rough, I thought it good enough to provide some useful data. The tool uses the python-zaqarclient library. Note that I didn’t push the servers particularly hard for these tests; web head CPUs averaged around 20%, while the mongod primary’s CPU usage peaked at around 10% with DB locking peaking at 5%. Several different messaging patterns were tested, taking inspiration from: https://wiki.openstack.org/wiki/Use_Cases_(Zaqar) Each test was executed three times and the best time recorded. A ~1K sample message (1398 bytes) was used for all tests. ## Results ## ### Event Broadcasting (Read-Heavy) ### OK, so let's say you have a somewhat low-volume source, but tons of event observers. In this case, the observers easily outpace the producer, making this a read-heavy workload. Options * 1 producer process with 5 gevent workers * 1 message posted per request * 2 observer processes with 25 gevent workers each * 5 messages listed per request by the observers * Load distributed across 4 queues * 10-second duration Results * Producer: 2.2 ms/req, 454 req/sec * Observer: 1.5 ms/req, 1224 req/sec ### Event Broadcasting (Balanced) ### This test uses the same number of producers and consumers, but note that the observers are still listing (up to) 5 messages at a time, so they still outpace the producers, but not as quickly as before. Options * 2 producer processes with 10 gevent workers each * 1 message posted per request * 2 observer processes with 25 gevent workers each * 5 messages listed per request by the observers * Load distributed across 4 queues * 10-second duration Results * Producer: 2.2 ms/req, 883 req/sec * Observer: 2.8 ms/req, 348 req/sec ### Point-to-Point Messaging ### In this scenario I simulated one client sending messages directly to a different client. Only one queue is required in this case. Note the higher latency. While running the test there were 1-2 message posts that skewed the average by taking much longer (~100ms) than the others to complete. Such outliers are probably present in the other tests as well, and further investigation is need to discover the root cause. Options * 1 producer process with 1 gevent worker * 1 message posted per request * 1 observer process with 1 gevent worker * 1 message listed per request * All load sent to a single queue * 10-second duration Results * Producer: 5.5 ms/req, 179 req/sec * Observer: 3.5 ms/req, 278 req/sec ### Task Distribution ### This test uses several producers and consumers in order to simulate distributing tasks to a worker pool. In contrast to the observer worker type, consumers claim and delete messages in such a way that each message is processed once and only once. Options * 2 producer processes with 25 gevent workers * 1 message posted per request * 2 consumer processes with 25 gevent workers * 5 messages claimed per request, then deleted one by one before claiming the next batch of messages * Load distributed across 4 queues * 10-second duration Results * Producer: 2.5 ms/req, 798 req/sec * Consumer * Claim: 8.4 ms/req * Delete: 2.5 ms/req * 813 req/sec (overall) ### Auditing / Diagnostics ### This test is the same as performed in Task Distribution, but also adds a few observers to the mix: Options * 2 producer processes with 25 gevent workers each * 1 message posted per request * 2 consumer processes with 25 gevent workers each * 5 messages claimed per request, then deleted one by one before claiming the next batch of messages * 1 observer processes with 5 gevent workers each * 5 messages listed per request * Load distributed across 4 queues * 10-second duration Results * Producer: 2.2 ms/req, 878 req/sec * Consumer * Claim: 8.2 ms/req * Delete: 2.3 ms/req * 876 req/sec (overall) * Observer: 7.4 ms/req, 133 req/sec ## Conclusions ## While more testing is needed to track performance against increasing load (spoiler: latency will increase), these initial results are Encouraging; turning around requests in ~10 (or even ~20) ms is fast enough for a variety of use cases. I anticipate enabling the keystone middleware will add 1-2 ms (assuming tokens are cached). Let’s keep digging and see what we can learn, and what needs to be improved. @kgriffs -------- : https://review.openstack.org/#/c/116384/ : Yes, I know that's some crazy IOPS, but there is plenty of RAM to avoid paging, so you should be able to get similar results with some regular disks, assuming they are decent enough to support enabling journaling (if you need that level of durability). : It would be interesting to verify the results presented here using Tsung and/or JMeter; zaqar-bench isn't particularly efficient, but it does provide the potential to do some interesting reporting, such as measuring the total end-to-end time of enqueuing and subsequently dequeuing each Message (TODO). In any case, I'd love to see the team set up a benchmarking cluster that runs 2-3 tools regularly (or as part of every patch) and reports the results so we always know where we stand. : Yes, I know this is a short duration; I'll try to do some longer tests in my next round of benchmarking. : In a real app, messages will usually be requested in batches. : In this test, the target client does not send a response message back to the sender. However, if it did, the test would still only require a single queue, since in Zaqar queues are duplex. : Chosen somewhat arbitrarily. : One might argue that the only thing these performance tests show is that *OnMetal* is fast. However, as I pointed out, there was plenty of headroom left on these servers during the tests, so similar results should be achievable using more modest hardware. _______________________________________________ OpenStack-dev mailing list OpenStackfirstname.lastname@example.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev