So, I got access to a bunch of fast machines through Yandex. Big kudoes
to them.  It allowed me to continue working on dpb optimizations for fast
clusters, after some tentalizing glimpse into big clusters I got a few months
ago thanks to some experiment led by Florian Obser.

First remark is that we don't really scale all that well on a lot of cpus
on the same machine (duh). It's probably faster to build things on a
cluster with 5~8 machines with 4 cores each than on 3 machines with 16
cores each...

It looks like you really really want to disable hyper-threading. On my
test, it amounts for a +20% performance increase.

Having memory helps... I gues the turning point is somewhere around
20GB-50GB per box... around there, you can build all ports in memory, and
also have other interesting parts in tmpfs as well, such as /usr/local.
It doesn't help as much as I would have guessed.

I've experimented with storing packages locally. Doesn't help all that much.
Building things directly on the NFS server doesn't hurt.

One thing that does hurt, though, is computing dependencies while building
packages. I have a set of patches to enable a "global depends cache" that
seems to shave an extra minute or so per package (to divide by 12 or so,
since this is wall-clock time on a single cpu).  I've already committed
the src/ part (tweaks to PkgCreate) and the rest will be in when ports
unlock. Specifically, it just requires creating cache entries in tmp files
and renaming them to their final destination, so that several packages may
be built simultaneously without stepping on each other toes.

I also discovered that thanks to some buggy code reorganization of mine,
I've inadvertendly nullified an important optimization (BUILD_ONCE) I
implemented a few years ago.... which was fairly easy to restore, fortunately.

Another interesting improvement was smarter scheduling of available cores.
The initial dpb code is very crude, and just grabs the first core available
to run things. If all machines fire up simultaneously, this ends up in
a "thundering herd" stampede, as 12 jobs are started on the first host,
then 12 on the next one, etc...   most specifically, the LISTING job starts
up slow, as it competes with another 11 jobs almost right away... and
dpb tends to empty its whole queue right away when you've got over 40 cpus
to play with... faster LISTING == bigger queue == greater chance of full
cpu utilisation.

One thing I haven't solved yet is that apparently, ssh in master mode tends
to refuse shared connections when you go to 16 jobs per machine... I haven't
investigated, not sure whether this is a limitation of ssh, or some machine
limit I haven't found.

(important note:  dpb with lots of cores gobbles resources like crazy...
you want to seriously crank up process#, fd#, memory usage).

With all this, the biggest contender left is that dpb will lose a lot of
time in "waiting-for-lock" states...  I've done a quick patch that helps
a common case: when a job is run parallel, it would release lots of cores
(maxcores/2) at the end of packaging, and hence... lots of waiting-for-lock.

The job can actually regurgitate those cores at the end of fake... this still
leads to loads of waiting-for-locks, but those will hopefully be solved by
the end of packaging.


I'm currently experimenting with further tricky patches to make jobs in
depends aware of other jobs waiting for the same lock on the same machine.
This entails trying to wake other jobs in order, trying to solve
depends for several jobs at once, and preventing junk from removing depends
for stuff that's already been analyzed, but unsolved yet.  These patches
are somewhat necessary: the code is rather complicated, but it yields some
impressive performance benefits: without it, the end of a normal bulk
wastes over 2 hours with most cores being in "waiting-for-lock" state.

(colors for dpb -DCOLOR mode will also be adjusted after unlock... turns out
yellow-over-red is very hard to read (duh) when you've got to shrink the font
to fit it all on a single screen, and it pays to distinguish waiting-for-lock
states from over frozen states).

I hope to have all this in decent enough shape by the time we unlock ports.


All in all, after a few weeks of tweaks and fun, I've come up with impressive
speed-ups...   That setup went down from 21 hours to 14 hours and a half for
a full bulk.

(as for particulars, we're talking three Xenon E5 machines with 128G of
ram each, 16 cpus per machines, hyper-threading disabled... 12 cpus actually
useful... running -j16 didn't help...  parallel set to 4 seems to yield
better result than the default 6 in that case...


Of course, the form-factor of the cluster is very important. A lot of
these problems don't even show up on smaller clusters, so it would have
been impossible to achieve without the donation.   So thanks again,
especially to Anton Karpov.

Reply via email to