davisp commented on issue #610: Optimize ddoc cache URL: https://github.com/apache/couchdb/pull/610#issuecomment-317068757 Performance Benchmarking Update: New one is faster! And now for some graphs. For each of these graphs the max_dbs_open was set to 5,000. That's important to remember as I go through the parameter sets whether we're hitting that limit which will be an artificial limit on performance of the cache. Each of these tests is also just querying an empty view which leads to a ddoc_cache lookup for the coordinator, and then Q lookups for each RPC worker. So the requests per second numbers have to be multiplied out to get actual cache performance. Granted we really don't care given that its a constant factor. These first two runs are 1,000 workers hitting 1,000 different design documents in 1,000 different databases with two different Q values (i.e., 1,000 workers using their own ddoc and db). For the most part their ops/sec is basically identical while latencies for the new ddoc_cache are slightly better. This suggests that something else was bottlenecking performance. Possibly the basho_bench driver and possibly something else in CouchDB. Old ddoc_cache Q=4:  New ddoc_cache Q=4:  Old ddoc_cache Q=8  New ddoc_cache Q=8  So basically, trying to hit a bunch of different databases at once now has slightly better latencies but both are fast enough to not be the bottleneck. Next up was a shoot the moon test to try and test the limits of what sort of performance we could get trying to crush a single entry. These two graphs show 1,000 workers hitting a single design document in a single database. The Q for both of these is 128 which means that the ddoc_cache is going to have to try and maintain 129,000 lookups/sec. Old ddoc_cache:  New ddoc_cache:  The immediate thing to note here is that old ddoc_cache flat lines half way into its test. This was because couch_log_server spiked its message queue. The reason for this is because of the huge flood of fabric_rpc_worker timeout messages being logged. If you look at the graph for the old ddoc_cache you can see that it has two cliffs, one at 60s, and one at 120s when it then flat lines cause it pushed couch_log_server off a cliff. These drops are precisely what motivated this work in the first place when the old ddoc_cache would evict entries every 60s and the thundering herd would knock things off their rocker. For the new ddoc_cache you'll see that its still kinda crap performance even though it doesn't totally lose its mind. In the background this was because rexi was unable to keep up with the work load and spiked pretty bad. I'm gonna be investigating that area for more optimizations once this work is wrapped up. And finally, this last graph is a comparison between the old and new ddoc_caches sweeping through Q=8, 16, and 32 with the same 1,000 workers against a single db and ddoc. Red is old ddoc_cache, green is new ddoc_cache. Yes I know Tufte would kill me but that's the colors that get picked and I don't care enough to go fiddle with it.  As you can see the new ddoc_cache is consistently faster as well as much less variable. Looking into the variance on the best Q=64 run for the old ddoc_cache vs the worst Q=64 run for new ddoc_cache shows a good example of the variance of the old approach. For these two runs I've also included what the rexi server message queue is doing so we can see its effect on performance. Old ddoc_cache:   New ddoc_cache:   Again we can see how badly that 60s eviction policy is when we have sustained load against a single design document. This leads to some fairly massive spikes in the system. For the old ddoc_cache on the third eviction at 180s we see it flat line again which is why those runs are so variable. For the new ddoc_cache we can see that db3's rex has sustained elevated message counts which are holding back the benchmark back from meeting some of the old ddoc_cache spikes when rexi had a chance to clear out. Now that I can duplicate that rexi issue easily enough though I'll be working on trying to figure out why its being slow and try and optimize around it. Hopefully this data is as convincing to everyone else as it is to me. If anyone wants me to check into anything else I certainly have the data and/or can design runs to try and run in some other configuration if requested. However poking both ends of the spectrum (lots of clients against separate design docs and lots of clients against a single design doc) I'm fairly confident that we're winning across the spectrum though most specifically for the single design doc case (which was the motivation for this work). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
With regards, Apache Git Services
