> yes, the DB query was in serial, after some investigation, it seems that we > are unable to perform eventlet.mockey_patch in uWSGI mode, so > Yikun made this fix: > > https://review.openstack.org/#/c/592285/
Cool, good catch :) > > After making this change, we test again, and we got this kind of data: > > total collect sort view > before monkey_patch 13.5745 11.7012 1.1511 0.5966 > after monkey_patch 12.8367 10.5471 1.5642 0.6041 > > The performance improved a little, and from the log we can saw: Since these all took ~1s when done in series, but now take ~10s in parallel, I think you must be hitting some performance bottleneck in either case, which is why the overall time barely changes. Some ideas: 1. In the real world, I think you really need to have 10x database servers or at least a DB server with plenty of cores loading from a very fast (or separate) disk in order to really ensure you're getting full parallelism of the DB work. However, because these queries all took ~1s in your serialized case, I expect this is not your problem. 2. What does the network look like between the api machine and the DB? 3. What do the memory and CPU usage of the api process look like while this is happening? Related to #3, even though we issue the requests to the DB in parallel, we still process the result of those calls in series in a single python thread on the API. That means all the work of reading the data from the socket, constructing the SQLA objects, turning those into nova objects, etc, all happens serially. It could be that the DB query is really a small part of the overall time and our serialized python handling of the result is the slow part. If you see the api process pegging a single core at 100% for ten seconds, I think that's likely what is happening. > so, now the queries are in parallel, but the whole thing still seems > serial. In your table, you show the time for "1 cell, 1000 instances" as ~3s and "10 cells, 1000 instances" as 10s. The problem with comparing those directly is that in the latter, you're actually pulling 10,000 records over the network, into memory, processing them, and then just returning the first 1000 from the sort. A closer comparison would be the "10 cells, 100 instances" with "1 cell, 1000 instances". In both of those cases, you pull 1000 instances total from the db, into memory, and return 1000 from the sort. In that case, the multi-cell situation is faster (~2.3s vs. ~3.1s). You could also compare the "10 cells, 1000 instances" case to "1 cell, 10,000 instances" just to confirm at the larger scale that it's better or at least the same. We _have_ to pull $limit instances from each cell, in case (according to the sort key) the first $limit instances are all in one cell. We _could_ try to batch the results from each cell to avoid loading so many that we don't need, but we punted this as an optimization to be done later. I'm not sure it's really worth the complexity at this point, but it's something we could investigate. --Dan __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev