> On 30 May 2017, at 09:15, Tony Sarajärvi <[email protected]> wrote: > > Hi all! > > I was due to write you something about the CI. This time I’ll cover the > topics performance as well as upcoming hardware and software changes. > > Performance: > > You have all noticed that the CI system is behaving poorly to what comes to > the performance. Sometimes autotests take over 30 times longer to run > compared to a normal situation. Now why is that? > > We actually have different kinds of bottlenecks. One is the bandwidth with > which the virtual machines (VMs) store local data to their virtual hard > drives. The servers on which our virtual machines run have no local hard > drives. They instead store all their data on a centralized storage called the > Compellent (https://en.wikipedia.org/wiki/Dell_Compellent). So when a VM > wants to store data on its virtual hard drive, the host it runs on actually > stores the data on the centralized storage. > > We have several generations of hardware installed, and they have different > speeds on their SAN interface with which they are connected to the > Compellent. As your build picks up a server, it can be a new rack server, or > it can be an older generation Blade > (https://en.wikipedia.org/wiki/Blade_server). Also, all the other VMs on > these servers share the same bandwidth, so depending on what the other builds > do, your SAN connection can be affected. Sadly, our test of prioritization > these VMs, didn’t produce expected results and really didn’t much at all. > > These generations of hardware and type of hardware also affect the amount of > other VMs the hardware can run simultaneously. We have Mac mini’s running > our macOS builds. Those generally run 1 VM per physical Mac mini. Then we > have old Blades that run around 4 VMs per Blade. The latest additions to our > hardware pool are dual socket 20 core CPU server racks. Those run up to 26 > VMs simultaneously. Running more on the same hardware reduces costs for us, > but also increases the odds of one build affecting the next. > > Another bottleneck is the Compellent itself. The storage system has 120 + 10 > spare hard drives spinning at 15K RPM. However great the IOPS performance is > in that system, when we decide to start 200+ VMs in the CI, it goes down to > its knees. And when it does that, all builds and autotest run are affected. > You could think of this as you having 2 computers at home sharing the same > spinning disk. > > Now that all is grim and morbid, let’s continue with the good news. > > Upcoming hardware changes: > > We’re replacing the current hardware stack with a completely new one. The > parts have arrived and are being installed as I type right now. Not only did > we acquire new hardware that is faster, but we also redesigned the building > concepts so that they utilize the hardware differently. The new hardware can > be easily expanded and we designed the system so, that we don’t produce > bottlenecks even when expanding it. > > Before I go in to details, I need to explain a bit how the CI systems > generally works. So heading off on a tangent here! When developer stages a > commit in Gerrit (codereview.qt-project.org), the CI picks it (or multiple > commits) up. The CI or Coin as the piece of software is figuratively called, > generates work items based on the data received. If let’s say the commits was > for QtDeclarative, Coin now produces work items for itself to handle that > QtDeclarative build on top of circa 30 different target platforms. Each of > these work items depend on QtBase to be built. So now Coin also creates these > circa 30 work items for QtBase. As QtDeclarative is built on top of the > current qt5.git’s QtBase, it means in normal situations that QtBase has > already been built previously. These artifacts have been stored by Coin and > can now be reused. So instead of rebuilding QtBase for QtDeclarative, Coin > simply checks its storage and links the previous builds to the current one > and promptly continues with building QtDeclarative. This is the major change > in how we build Qt nowadays compared to old days with Jenkins where every > build always rebuilt the entire stack up to its point. > > Continuing into more details. Whenever Coin starts to build something, it > needs a VM for the task. We have “templates” in vSphere that represent > different operating systems. They are virtual machines that have operating > systems installed in them along with some basic things set up like user > accounts and SSHD etc. Then they have been shut down ready to be used. Now > when a build needs a VM, it clones a template VM and launches the clone. The > clone is actually only a linked clone. This means that we don’t really clone > anything, but only create a new virtual machine that links or points to the > original one. Now, when the new clone is powered on, it _reads_ from the > original template, but all changes are _written_ to its own file called the > ‘delta image’. This way a new virtual machine only takes up space that’s > equal to the amount of data it has written. > > Going back to the template again. I said that it only contained basic things > like user accounts and SSHD. A build surely needs more than that. We need > Visual Studios, MINGW, CMake, OpenSSH, XCodes, MySQL etc installed as well. > Those things are ‘provisioned’. In qt5.git we have a folder structure under > /coin/provisioning that contains scripts that install these things. As there > is no point in running them every time for every VM, we create yet another > set of templates that contain these pre-installed. We call these TIER2 images > (or templates) vs TIER1 images being the vanilla distros containing only the > basic things enabling us to even use them. > > TIER2 images work pretty much the same way as QtBase was a dependency for > QtDeclarative. Each build we trigger checks the current configurations > scripts from qt5.git and makes a SHA from the folder structure. This SHA is > used in naming the TIER2 image. If the content we want to install has > changed, we have to regenerate a new TIER2 image. This is called the > provisioning and it’s triggered automatically if the requested TIER2 image > doesn’t exist. > > Now, let’s go back on track and talk about the hardware changes. > > The new servers have local SSD drives that work as the storage for the VMs > instead of a centralized storage. This removes the bottleneck of a SAN > network and reduces latencies while at it. And while being SSD drives, they > are faster by design to what the Compellent used to be with its rotating > discs. We still have a Compellent, but this time it’s filled with SSD drives. > While the VMs use local SSD drives on the hosts themselves to store data, > reading is a more complicated thing. The TIER1 and TIER2 images described > earlier are still stored centralized on the Compellent. This saves us the > transferring of the images to each server serving as the host for the VMs. > These TIER2 images are cloned as normal, and then the read operations point > to the source. This would cause the same situation as with the old system > where everything is read from the Compellent, but we are relying on caches to > work in our favor here. The TIER2 images are shared via NFS, and the host OS > on the server is equipped with a 500 GB NFS cache. So whenever something is > read from the TIER2, it is in fact now read from the NFS cache that’s local. > All this is obviously assuming that the data has been read once previously. > In practice, if a TIER2 image gets updated, the data has to be read from the > centralized storage once, and then it’s in cache for the rest of the builds. > We also have to remember that not the entire TIER2 image is read whenever > data is read. If a build requests openssh.so, only those blocks containing > the file are read. > > We also need the Compellent to provide us with redundancy for critical > systems and a huge data storage for data that can’t be stored distributed. > Critical systems include our own infrastructure and the storage is needed for > all kinds of data including our release packages, install packages, distro > ISO images etc. So even if we had a good mechanism to distribute the entire > TIER1 and TIER2 load to the servers themselves, currently there is no need > for it and the Compellent serves this need more than well right now. > > The new hardware infrastructure will include new switches and firewalls as > well. And all these are being set up in new premises, so everything is new. > With this we will expect a few maintenance breaks during the upcoming months > where services are being handed over from one site to the other. The down > times should be relatively low, since all data is being transferred > beforehand and not during the down times. > > Software changes: > > Currently Coin is using VMware’s vSphere technology to create and run VMs. > That’s about to change. Our new spinal cord will be based on OpenNebula > (https://opennebula.org/). The swap to this new technology will come at the > same time we switch to the new facility with the new hardware. We’ve been > working hard to get the robustness and reliability up, matching or even > exceeding the one provided by VMware’s products. With open source > non-proprietary code we can go deep into the root causes of problems and fix > drivers if that’s needed to make our VMs run smoothly without hick-ups. With > OpenNebula being KVM based, we can expect new distro support to be available > sooner as well. No longer do we need to fall back to saying a new macOS can’t > be installed because VMware doesn’t support it. Let’s hope I can hold up to > this promise or claim 😉 > > Performance wise the comparison between VMware and OpenNebula is a bit unfair > since they use different underlying hardware, but we can say that builds > aren’t going to get any slower by the looks of it. > > We’re also working on getting all of our distros more provision scripted. > This will make it a lot easier for anyone ( yes, this includes you ) to > upgrade the software that’s being run on the VMs. Anyone can access > qt5.git/coin/provisioning and modify / add scripts there. Normal code review > procedures apply and TIER2 images get updated. > > Internally we’ve had 3 different Jenkins instances in the past. We had one > for CI that got replaced by Coin a year ago. The remaining two were for > release package creation, Creator builds and few others, and the second one > was for RTA standing for Release Test Automation where we verified the > packages to really install something and examples working etc. Those two > Jenkins instances are planned to be merged with Coin at some point in time, > but for the time being they are going to stay there for a while. However, > we’re improving the backend of how they receive their VMs. They currently > compete with Coin in getting hardware resources. In the next weeks this is > going to be changed so that Coin creates these VMs. This takes away the race > conditions between two back ends, but also gives our Jenkins instances > “support” for OpenNebula VMs. Even if this does not show up directly to you > as CI users, it should show up with slightly more reliable VM dedication, > more effective cleanup of VMs, and at least from the technical perspective we > should be more capable of producing packages faster. > > > I hope you found this interesting. If you have any questions feel free to ask > in public or in private. I’ll be happy to answer if I can 😊 > > Have a great summer! > -Tony > > Tony Sarajärvi > CI Tech Lead > > > > > _______________________________________________ > Development mailing list > [email protected] > http://lists.qt-project.org/mailman/listinfo/development
Hi, Thank you very much for the detailed post ! It’s pretty great to be able to understand what’s going on behind the "Merge Patch Set To Staging" button. Cheers Samuel
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ Development mailing list [email protected] http://lists.qt-project.org/mailman/listinfo/development
