On Thu, Aug 26, 2021 at 4:03 AM Ed W <li...@wildgooses.com> wrote:
> Hi All
> Consider this a tentative first email to test the water, but I have started 
> to look at performance
> of particularly the install phase of the emerge utility and I could use some 
> guidance on where to go
> next

To clarify; the 'install' phase installs the package into ${D}. The
'qmerge' phase is the phase that merges to the livefs.

> Firstly, to define the "problem": I have found gentoo to be a great base for 
> building custom
> distributions and I use it to build a small embedded distro which runs on a 
> couple of different
> architectures. (Essentially just a "ROOT=/something emerge $some_packages"). 
> However, I use some
> packaging around binpackages to avoid uncessary rebuilds, and this highlights 
> that "building" a
> complete install using only binary packages rarely gets over a load of 1. Can 
> we do better than
> this? Seems to be highly serialised on the install phase of copying the files 
> to the disk?

In terms of parallelism it's not safe to run multiple phase functions
simultaneously. This is a problem in theory and occasionally in
practice (recently discussed in #gentoo-dev.)
The phase functions run arbitrary code that modifies the livefs (as
pre / post install and rm can touch $ROOT.) As an example we observed
recently; font ebuilds will generate font related metadata. If 2
ebuilds try to generate the metadata at the same time; they can race
and cause unexpected results. Sometimes this is caught in the ebuild
(e.g. they wrote code like rebuild_indexes || die and the indexer
returned non-zero) but can simply result in silent data corruption
instead; particularly if the races go undetected.

> (Note I use parallel build and parallel-install flags, plus --jobs=N. If 
> there is code to compile
> then load will shoot up, but simply installing binpackages struggles to get 
> the load over about
> 0.7-1.1, so presumably single threaded in all parts?)
> Now, this is particularly noticeable where I cheated to build my arm install 
> and just used qemu
> user-mode on an amd64 host (rather than using cross-compile). Here it's very 
> noticeable that the
> install/merge phase of the build is consuming much/most of the install time.
> eg, random example (under qemu user mode)

I think perhaps a simpler test is to use qmerge (from portage-utils)?
If you can use emerge (e.g. in --pretend mode) to generate a package
list to merge; you can simply merge them with qmerge. I suspect qmerge
will both (a) be faster and (b) be less safe than emerge; as emerge is
doing a bunch of extra work you may or may not care about. You can
also consider running N qmerge's (again less sure how safe this is; as
the writes by qmerge may be racy.) Note again that this speed may not
come for free and you may end up with a corrupt image afterwards.

I'm not sure if folks are running qmerge in production like this
(maybe others on the list have experience.)

> # time ROOT=/tmp/timetest emerge -1k --nodeps openssl
> >>> Emerging binary (1 of 1) dev-libs/openssl-1.1.1k-r1::gentoo for 
> >>> /tmp/timetest/
> ...
> real    0m30.145s
> user    0m29.066s
> sys    0m1.685s
> Running the same on the native host is about 5-6sec, (and I find this ratio 
> fairly consistent for
> qemu usermode, about 5-6x slower than native)
> If I pick another package with fewer files, then I will see this 5-6 secs 
> drop, suggesting (without
> offering proof) that the bulk of the time here is some "per file" processing.
> Note this machine is a 12 core AMD ryzen 3900x with SSDs that bench around 
> the 4GB/s+. So really 5-6
> seconds to install a few files is relatively "slow". Random benchmark on this 
> machine might be that
> I can backup 4.5GB of chroot with tar+zstd in about 4 seconds.
> So the question is: I assume that further parallelisation of the install 
> phase will be difficult,
> therefore the low hanging fruit here seems to be the install/merge phase and 
> why there seems to be
> quite a bit of CPU "per file installed"? Can anyone give me a leg up on how I 
> could benchmark this
> further and look for the hotspot? Perhaps someone understand the architecture 
> of this point more
> intimately and could point at whether there are opportunities to do some of 
> the processing on mass,
> rather than per file?
> I'm not really a python guru, but interested to poke further to see where the 
> time is going.
> Many thanks
> Ed W

Reply via email to