[GitHub] davisp commented on issue #610: Optimize ddoc cache

git Fri, 21 Jul 2017 10:52:36 -0700

davisp commented on issue #610: Optimize ddoc cache
URL: https://github.com/apache/couchdb/pull/610#issuecomment-317068757

Performance Benchmarking Update: New one is faster!

And now for some graphs. For each of these graphs the max_dbs_open was set
to 5,000. That's important to remember as I go through the parameter sets
whether we're hitting that limit which will be an artificial limit on
performance of the cache. Each of these tests is also just querying an empty
view which leads to a ddoc_cache lookup for the coordinator, and then Q lookups
for each RPC worker. So the requests per second numbers have to be multiplied
out to get actual cache performance. Granted we really don't care given that
its a constant factor.

These first two runs are 1,000 workers hitting 1,000 different design
documents in 1,000 different databases with two different Q values (i.e., 1,000
workers using their own ddoc and db). For the most part their ops/sec is
basically identical while latencies for the new ddoc_cache are slightly better.
This suggests that something else was bottlenecking performance. Possibly the
basho_bench driver and possibly something else in CouchDB.

Old ddoc_cache Q=4:

![old-ddoc-cache-multi-conc-1000-q-4](https://user-images.githubusercontent.com/19929/28473937-484ccfe2-6e0c-11e7-9b9b-bbb75c122d79.png)

New ddoc_cache Q=4:

![new-ddoc-cache-multi-conc-1000-q-4](https://user-images.githubusercontent.com/19929/28473949-51a22c4a-6e0c-11e7-8ec2-5da1a9aca6ca.png)

Old ddoc_cache Q=8

![old-ddoc-cache-multi-1000-q-8](https://user-images.githubusercontent.com/19929/28473962-59e8c404-6e0c-11e7-89de-52cdbab4c8ab.png)

New ddoc_cache Q=8

![new-ddoc-cache-multi-1000-q-8](https://user-images.githubusercontent.com/19929/28473982-64b34526-6e0c-11e7-8310-9f8f1e611114.png)

So basically, trying to hit a bunch of different databases at once now has
slightly better latencies but both are fast enough to not be the bottleneck.

Next up was a shoot the moon test to try and test the limits of what sort of
performance we could get trying to crush a single entry. These two graphs show
1,000 workers hitting a single design document in a single database. The Q for
both of these is 128 which means that the ddoc_cache is going to have to try
and maintain 129,000 lookups/sec.

Old ddoc_cache:

![old-ddoc-cache-single-conc-1000-q-128](https://user-images.githubusercontent.com/19929/28474080-cab83e80-6e0c-11e7-8895-38e8d03d89ea.png)

New ddoc_cache:

![new-ddoc-cache-single-conc-1000-q-128](https://user-images.githubusercontent.com/19929/28474085-cf969618-6e0c-11e7-8b86-1f63781f8bf8.png)

The immediate thing to note here is that old ddoc_cache flat lines half way
into its test. This was because couch_log_server spiked its message queue. The
reason for this is because of the huge flood of fabric_rpc_worker timeout
messages being logged. If you look at the graph for the old ddoc_cache you can
see that it has two cliffs, one at 60s, and one at 120s when it then flat lines
cause it pushed couch_log_server off a cliff. These drops are precisely what
motivated this work in the first place when the old ddoc_cache would evict
entries every 60s and the thundering herd would knock things off their rocker.

For the new ddoc_cache you'll see that its still kinda crap performance even
though it doesn't totally lose its mind. In the background this was because
rexi was unable to keep up with the work load and spiked pretty bad. I'm gonna
be investigating that area for more optimizations once this work is wrapped up.

And finally, this last graph is a comparison between the old and new
ddoc_caches sweeping through Q=8, 16, and 32 with the same 1,000 workers
against a single db and ddoc. Red is old ddoc_cache, green is new ddoc_cache.
Yes I know Tufte would kill me but that's the colors that get picked and I
don't care enough to go fiddle with it.

![comparison-conc-100-q-8-16-32](https://user-images.githubusercontent.com/19929/28474305-bd14845e-6e0d-11e7-9e20-ae7615fa8772.png)

As you can see the new ddoc_cache is consistently faster as well as much
less variable. Looking into the variance on the best Q=64 run for the old
ddoc_cache vs the worst Q=64 run for new ddoc_cache shows a good example of the
variance of the old approach. For these two runs I've also included what the
rexi server message queue is doing so we can see its effect on performance.

Old ddoc_cache:

![old-ddoc-cache-single-conc-1000-q-64](https://user-images.githubusercontent.com/19929/28475592-93cebc68-6e12-11e7-9ea9-4c6f4ad26c22.png)

![old-ddoc-cache-single-rexi-conc-1000-q-64](https://user-images.githubusercontent.com/19929/28475651-cdf43a44-6e12-11e7-9d0f-115f18cc075f.png)

New ddoc_cache:

![new-ddoc-cache-single-conc-1000-q-64](https://user-images.githubusercontent.com/19929/28475599-995e9400-6e12-11e7-898e-c88130ca4e46.png)

![new-ddoc-cache-single-rexi-conc-1000-q-64](https://user-images.githubusercontent.com/19929/28475656-d1680d54-6e12-11e7-9ab1-21922eb626c8.png)

Again we can see how badly that 60s eviction policy is when we have
sustained load against a single design document. This leads to some fairly
massive spikes in the system. For the old ddoc_cache on the third eviction at
180s we see it flat line again which is why those runs are so variable.

For the new ddoc_cache we can see that db3's rex has sustained elevated
message counts which are holding back the benchmark back from meeting some of
the old ddoc_cache spikes when rexi had a chance to clear out. Now that I can
duplicate that rexi issue easily enough though I'll be working on trying to
figure out why its being slow and try and optimize around it.

Hopefully this data is as convincing to everyone else as it is to me. If
anyone wants me to check into anything else I certainly have the data and/or
can design runs to try and run in some other configuration if requested.
However poking both ends of the spectrum (lots of clients against separate
design docs and lots of clients against a single design doc) I'm fairly
confident that we're winning across the spectrum though most specifically for
the single design doc case (which was the motivation for this work).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



With regards,
Apache Git Services

[GitHub] davisp commented on issue #610: Optimize ddoc cache

Reply via email to