davisp commented on issue #610: Optimize ddoc cache
URL: https://github.com/apache/couchdb/pull/610#issuecomment-317068757
 
 
   Performance Benchmarking Update: New one is faster!
   
   And now for some graphs. For each of these graphs the max_dbs_open was set 
to 5,000. That's important to remember as I go through the parameter sets 
whether we're hitting that limit which will be an artificial limit on 
performance of the cache. Each of these tests is also just querying an empty 
view which leads to a ddoc_cache lookup for the coordinator, and then Q lookups 
for each RPC worker. So the requests per second numbers have to be multiplied 
out to get actual cache performance. Granted we really don't care given that 
its a constant factor.
   
   These first two runs are 1,000 workers hitting 1,000 different design 
documents in 1,000 different databases with two different Q values (i.e., 1,000 
workers using their own ddoc and db). For the most part their ops/sec is 
basically identical while latencies for the new ddoc_cache are slightly better. 
This suggests that something else was bottlenecking performance. Possibly the 
basho_bench driver and possibly something else in CouchDB.
   
   Old ddoc_cache Q=4:
   
   
![old-ddoc-cache-multi-conc-1000-q-4](https://user-images.githubusercontent.com/19929/28473937-484ccfe2-6e0c-11e7-9b9b-bbb75c122d79.png)
   
   New ddoc_cache Q=4:
   
   
![new-ddoc-cache-multi-conc-1000-q-4](https://user-images.githubusercontent.com/19929/28473949-51a22c4a-6e0c-11e7-8ec2-5da1a9aca6ca.png)
   
   Old ddoc_cache Q=8
   
   
![old-ddoc-cache-multi-1000-q-8](https://user-images.githubusercontent.com/19929/28473962-59e8c404-6e0c-11e7-89de-52cdbab4c8ab.png)
   
   New ddoc_cache Q=8
   
   
![new-ddoc-cache-multi-1000-q-8](https://user-images.githubusercontent.com/19929/28473982-64b34526-6e0c-11e7-8310-9f8f1e611114.png)
   
   So basically, trying to hit a bunch of different databases at once now has 
slightly better latencies but both are fast enough to not be the bottleneck.
   
   Next up was a shoot the moon test to try and test the limits of what sort of 
performance we could get trying to crush a single entry. These two graphs show 
1,000 workers hitting a single design document in a single database. The Q for 
both of these is 128 which means that the ddoc_cache is going to have to try 
and maintain 129,000 lookups/sec.
   
   Old ddoc_cache:
   
   
![old-ddoc-cache-single-conc-1000-q-128](https://user-images.githubusercontent.com/19929/28474080-cab83e80-6e0c-11e7-8895-38e8d03d89ea.png)
   
   New ddoc_cache:
   
   
![new-ddoc-cache-single-conc-1000-q-128](https://user-images.githubusercontent.com/19929/28474085-cf969618-6e0c-11e7-8b86-1f63781f8bf8.png)
   
   The immediate thing to note here is that old ddoc_cache flat lines half way 
into its test. This was because couch_log_server spiked its message queue. The 
reason for this is because of the huge flood of fabric_rpc_worker timeout 
messages being logged. If you look at the graph for the old ddoc_cache you can 
see that it has two cliffs, one at 60s, and one at 120s when it then flat lines 
cause it pushed couch_log_server off a cliff. These drops are precisely what 
motivated this work in the first place when the old ddoc_cache would evict 
entries every 60s and the thundering herd would knock things off their rocker.
   
   For the new ddoc_cache you'll see that its still kinda crap performance even 
though it doesn't totally lose its mind. In the background this was because 
rexi was unable to keep up with the work load and spiked pretty bad. I'm gonna 
be investigating that area for more optimizations once this work is wrapped up.
   
   And finally, this last graph is a comparison between the old and new 
ddoc_caches sweeping through Q=8, 16, and 32 with the same 1,000 workers 
against a single db and ddoc. Red is old ddoc_cache, green is new ddoc_cache. 
Yes I know Tufte would kill me but that's the colors that get picked and I 
don't care enough to go fiddle with it.
   
   
![comparison-conc-100-q-8-16-32](https://user-images.githubusercontent.com/19929/28474305-bd14845e-6e0d-11e7-9e20-ae7615fa8772.png)
   
   As you can see the new ddoc_cache is consistently faster as well as much 
less variable. Looking into the variance on the best Q=64 run for the old 
ddoc_cache vs the worst Q=64 run for new ddoc_cache shows a good example of the 
variance of the old approach. For these two runs I've also included what the 
rexi server message queue is doing so we can see its effect on performance.
   
   Old ddoc_cache:
   
   
![old-ddoc-cache-single-conc-1000-q-64](https://user-images.githubusercontent.com/19929/28475592-93cebc68-6e12-11e7-9ea9-4c6f4ad26c22.png)
   
   
![old-ddoc-cache-single-rexi-conc-1000-q-64](https://user-images.githubusercontent.com/19929/28475651-cdf43a44-6e12-11e7-9d0f-115f18cc075f.png)
   
   New ddoc_cache:
   
   
![new-ddoc-cache-single-conc-1000-q-64](https://user-images.githubusercontent.com/19929/28475599-995e9400-6e12-11e7-898e-c88130ca4e46.png)
   
   
![new-ddoc-cache-single-rexi-conc-1000-q-64](https://user-images.githubusercontent.com/19929/28475656-d1680d54-6e12-11e7-9ab1-21922eb626c8.png)
   
   Again we can see how badly that 60s eviction policy is when we have 
sustained load against a single design document. This leads to some fairly 
massive spikes in the system. For the old ddoc_cache on the third eviction at 
180s we see it flat line again which is why those runs are so variable.
   
   For the new ddoc_cache we can see that db3's rex has sustained elevated 
message counts which are holding back the benchmark back from meeting some of 
the old ddoc_cache spikes when rexi had a chance to clear out. Now that I can 
duplicate that rexi issue easily enough though I'll be working on trying to 
figure out why its being slow and try and optimize around it.
   
   Hopefully this data is as convincing to everyone else as it is to me. If 
anyone wants me to check into anything else I certainly have the data and/or 
can design runs to try and run in some other configuration if requested. 
However poking both ends of the spectrum (lots of clients against separate 
design docs and lots of clients against a single design doc) I'm fairly 
confident that we're winning across the spectrum though most specifically for 
the single design doc case (which was the motivation for this work).
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to