Recently I worked on some performance related issues and noticed a pattern in
the code that lead to increased latency for some APIs in a scaled up
environment (> 10K user VMs, > 10K hosts). The pattern is like this:
List<Host> hosts = listHosts(); // based on some filter
for (Host h : hosts) {
// do some processing
}
You can replace host with other entities like user VMs etc. Functionally there
is nothing wrong and for smaller environments works perfectly fine. But as the
size of the deployment grows the effect is much more visible. Think about the
entities being operated upon and how they grow as the size of the environment
grows. In these scenarios the looping should be avoided to the extent possible
by offloading the computation to the database. If required modify the database
schemas to handle such scenarios.
Another aspect is various synchronisations present in the code. It is best if
these can be avoided but there are situations when these constructs needs to be
used. But they should be used carefully and only the minimal amount of code
should be guarded using them otherwise they can kill performance in a scaled up
environment. For e.g. I came across a scenario like below in the code
lock() {
// read something from db
// check the state and based on that do some update
// update such that it is synchronised
}
In the above logic all threads wait on the lock irrespective of whether update
is needed or not. But it can be optimised like below
// read from db
// check the state
if (updateRequired) {
lock() {
// again read to ensure state not changed since last read
if (updateRequired) {
// do actual update
}
}
}
These are simple things to check out for while adding new code or working on
bugs. Also feel free to raise bugs/fix them if you come across code that can
cause latency.
Thanks,
Koushik