mlsorensen opened a new pull request, #8012:
URL: https://github.com/apache/cloudstack/pull/8012

   ### Description
   
   This PR addresses an issue found while scale testing CloudStack's 
`listVirtualMachines` API. It was observed that as the VM record count grows, 
the list APIs take progressively more time, even using paging to a smaller 
result set. With as little as 50k records the API starts to become noticeably 
slow, and by 100k records it becomes almost impossible to page through the VMs 
in the system in any reasonable amount of time. 100k records is not much in 
terms of MySQL, so I began trying to find the cause.
   
   <img width="581" alt="Screenshot 2023-09-27 at 12 45 55 PM" 
src="https://github.com/apache/cloudstack/assets/1047709/d19e2bd7-cbf6-4f21-b78b-d316e1643f5b";>
   
   It seems to mostly affect large result sets, but in some cases also affects 
queries with small result sets that require searches (like keyword search).
   
   The UserVm view is joined to many other tables, and left joined so we get 
multiplication of rows.  In order to page, the full potential result set needs 
to be generated, and then the relevant page in the result set is selected. The 
current code fetches the relevant Vm IDs for the page from the view, then calls 
the view again to generate the API response for these VM IDs. 
   
   I was unable to find any meaningful way to increase performance of the 
UserVM view, maybe I missed something.  However, I found that the UserVM table 
itself is much faster. As an alternative, I decided to modify the 
listVirtualMachines query to avoid the view and just join tables as necessary 
based on the input params.  This works because the search to fetch the page 
only needs the IDs, so no need to search the view.
   
   The result is anywhere from a **2x to 8x improvement in list call times**.
   
   One aspect of this PR is a change to how joins are processed. I added unit 
tests to ensure the existing behavior was unmodified, but I needed to add the 
ability to join the same table twice and generate an SQL table alias to do so. 
This was relevant for the performance of SSH key search, as with older VMs the 
only way to perform this search is via the value of user vm details, there is 
no direct map from VM to SSH key. To optimize this search a bit I added an 
additional join to narrow the scope by account, but to do so I needed to join 
the same table twice to other joined tables.
   
   Ok, **this code is ugly**. I don't think it's much worse than the existing 
code was, but it's a big method with a lot of string binding per the way 
`SearchBuilder/SearchCriteria` work. If there's a much better way to refactor 
this, I'm all ears, but without cloudstack having some sort of benchmark suite 
it's a significant effort to test and re-test this. I'd be happy to have some 
help.
   
   ### Types of changes
   
   - [ ] Breaking change (fix or feature that would cause existing 
functionality to change)
   - [ ] New feature (non-breaking change which adds functionality)
   - [ ] Bug fix (non-breaking change which fixes an issue)
   - [x] Enhancement (improves an existing feature and functionality)
   - [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
   
   ### Feature/Enhancement Scale or Bug Severity
   
   #### Feature/Enhancement Scale
   
   - [ ] Major
   - [x] Minor
   
   #### Bug Severity
   
   - [ ] BLOCKER
   - [ ] Critical
   - [ ] Major
   - [x] Minor
   - [ ] Trivial
   
   
   ### Screenshots (if appropriate):
   
   
   ### How Has This Been Tested?
   
   This is probably the most used API, so I'm very cautious about trying to 
make a change here where I'm not super familiar with the code. There are a 
number of tests that cover listing virtual machines, but I needed an explicit 
comparison to old vs new.
   
   In testing I focused on regressions, assuming the existing state is correct. 
If there is a bug in what listVirtualMachines returns, it is replicated.
   
   I created a scale environment with two management servers, and a separate 
database server. One mgmt server was patched with this PR, and the other is 
unpatched. I then ran through varieties of listVirtualMachines calls and 
compared the result output to see if they were a perfect match between the 
patched and unpatched, as well as collected timing.
   
   I tested some combo queries (two or more parameters) to see if multiple 
joins were behaving properly.
   <img width="891" alt="Screenshot 2023-09-27 at 12 44 39 PM" 
src="https://github.com/apache/cloudstack/assets/1047709/3583fd2d-f0a1-470a-bfe3-6d12a5c32534";>
   
   <img width="886" alt="Screenshot 2023-09-27 at 12 44 29 PM" 
src="https://github.com/apache/cloudstack/assets/1047709/14d14526-680c-4fad-93de-60154460b9d5";>
   
   I also tested with basic user API keys, mostly focusing on exercising that 
they don't see anything they shouldn't, or nothing has changed. The result sets 
are smaller, but there is still a noticeable performance boost.
   
   <img width="884" alt="Screenshot 2023-09-27 at 12 44 57 PM" 
src="https://github.com/apache/cloudstack/assets/1047709/8c662ab3-1400-4f9c-8443-bfe059e3002e";>
   <img width="895" alt="Screenshot 2023-09-27 at 12 44 48 PM" 
src="https://github.com/apache/cloudstack/assets/1047709/ff6144f9-6070-4c55-9a31-6cc33084667c";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to