So, I got access to a bunch of fast machines through Yandex. Big kudoes to them. It allowed me to continue working on dpb optimizations for fast clusters, after some tentalizing glimpse into big clusters I got a few months ago thanks to some experiment led by Florian Obser.
First remark is that we don't really scale all that well on a lot of cpus on the same machine (duh). It's probably faster to build things on a cluster with 5~8 machines with 4 cores each than on 3 machines with 16 cores each... It looks like you really really want to disable hyper-threading. On my test, it amounts for a +20% performance increase. Having memory helps... I gues the turning point is somewhere around 20GB-50GB per box... around there, you can build all ports in memory, and also have other interesting parts in tmpfs as well, such as /usr/local. It doesn't help as much as I would have guessed. I've experimented with storing packages locally. Doesn't help all that much. Building things directly on the NFS server doesn't hurt. One thing that does hurt, though, is computing dependencies while building packages. I have a set of patches to enable a "global depends cache" that seems to shave an extra minute or so per package (to divide by 12 or so, since this is wall-clock time on a single cpu). I've already committed the src/ part (tweaks to PkgCreate) and the rest will be in when ports unlock. Specifically, it just requires creating cache entries in tmp files and renaming them to their final destination, so that several packages may be built simultaneously without stepping on each other toes. I also discovered that thanks to some buggy code reorganization of mine, I've inadvertendly nullified an important optimization (BUILD_ONCE) I implemented a few years ago.... which was fairly easy to restore, fortunately. Another interesting improvement was smarter scheduling of available cores. The initial dpb code is very crude, and just grabs the first core available to run things. If all machines fire up simultaneously, this ends up in a "thundering herd" stampede, as 12 jobs are started on the first host, then 12 on the next one, etc... most specifically, the LISTING job starts up slow, as it competes with another 11 jobs almost right away... and dpb tends to empty its whole queue right away when you've got over 40 cpus to play with... faster LISTING == bigger queue == greater chance of full cpu utilisation. One thing I haven't solved yet is that apparently, ssh in master mode tends to refuse shared connections when you go to 16 jobs per machine... I haven't investigated, not sure whether this is a limitation of ssh, or some machine limit I haven't found. (important note: dpb with lots of cores gobbles resources like crazy... you want to seriously crank up process#, fd#, memory usage). With all this, the biggest contender left is that dpb will lose a lot of time in "waiting-for-lock" states... I've done a quick patch that helps a common case: when a job is run parallel, it would release lots of cores (maxcores/2) at the end of packaging, and hence... lots of waiting-for-lock. The job can actually regurgitate those cores at the end of fake... this still leads to loads of waiting-for-locks, but those will hopefully be solved by the end of packaging. I'm currently experimenting with further tricky patches to make jobs in depends aware of other jobs waiting for the same lock on the same machine. This entails trying to wake other jobs in order, trying to solve depends for several jobs at once, and preventing junk from removing depends for stuff that's already been analyzed, but unsolved yet. These patches are somewhat necessary: the code is rather complicated, but it yields some impressive performance benefits: without it, the end of a normal bulk wastes over 2 hours with most cores being in "waiting-for-lock" state. (colors for dpb -DCOLOR mode will also be adjusted after unlock... turns out yellow-over-red is very hard to read (duh) when you've got to shrink the font to fit it all on a single screen, and it pays to distinguish waiting-for-lock states from over frozen states). I hope to have all this in decent enough shape by the time we unlock ports. All in all, after a few weeks of tweaks and fun, I've come up with impressive speed-ups... That setup went down from 21 hours to 14 hours and a half for a full bulk. (as for particulars, we're talking three Xenon E5 machines with 128G of ram each, 16 cpus per machines, hyper-threading disabled... 12 cpus actually useful... running -j16 didn't help... parallel set to 4 seems to yield better result than the default 6 in that case... Of course, the form-factor of the cluster is very important. A lot of these problems don't even show up on smaller clusters, so it would have been impossible to achieve without the donation. So thanks again, especially to Anton Karpov.
