Hello everybody,
those of us using overlays might have noticed that they can seriously slow
down dependency calculation. This is mostly because of the lack of a metadata
cache.
For overlay maintainers providing a metadata cache is quite tricky because to
be really consistent and useful it'd have to be regenerated after every
commit. That's quite easy to forget or get wrong.
So I sat down, brained some thoughts and played around a bit. Here's what I
came up with:
* server-side each overlay is checked out
* for every overlay in our list:
- we add it to make.conf explicitly (avoids any spillover effects)
- we let egencache generate a metadata cache for that repository
* we rsync the repositories with metadata to a different directory
The last step is just there to get rid of all the "unneeded" data like .svn
directories and can be used to selectively exclude other data that is in the
repo but not needed for end-users. Plus it reduces inconsistent data when a
client copies the data while the metadata cache is being generated.
egencache creates the per-repository cache in metadata/cache, so it is nicely
bundled and won't interfere with anything else.
So now we have all repositories, with metadata, in one place. We can start an
rsync daemon sharing the parent directory. For users this makes things easier
- instead of needind cvs, svn, git, darcs, hg, etc. etc. they only need rsync
(which they already have installed!)
Layman gets easier too - it just needs to understand the rsync protocol and
select the right directory(s).
The only issue I have found with this idea relates to eclasses - overriding
in-tree eclasses to be precise. The problem there is that it invalidates in-
tree metadata and potentially affects other overlays too. So that's a bit of a
bummer, but then I wonder how common that case is.
For performance, the difference is noticeable. As a very rough pointer it
takes me ~15 minutes for "emerge -puNDv world" with three overlays and no
metadata cache and about 75 seconds with metadata cache. That's of course a
"worst case" scenario.
Generating the metadata cache isn't that expensive - it took about 45 minutes
to initially check out almost everything layman provided and then about an
hour for the first run. Consecutive runs should be much faster and can be run
in parallel per overlay (at least in theory). So unless I missed something
really big really obvious it should be "small enough" to be run every hour or
even faster.
Advantages are:
- less deps for layman (if it is adapted)
- less complexity client-side
- faster sync performance - especially svn and git transfer way too much, the
initial checkout of one overlay was >35M data for a few dozen ebuilds
- less load server-side. Rsync is easy to replicate and relatively cheap.
Popular overlays will appreciate the reduced traffic :)
- faster dependency calculation
and a few I have already forgotten.
Disadvantages are:
- syncing the main tree can invalidate most of the metadata cache (changed
eclasses etc), so you need to sync the overlays at the same time
- the eclass override situation I mentioned earlier
- slower update time (right now users can checkout immediately after a commit,
with this indirection it'd be 30min+ delay)
If I don't get distracted I might set up a proof of concept public rsync
server providing the main repo plus all overlays I can throw in, but it'd have
a low initial update frequency (6h to daily).
Your thoughts, opinions and other input is appreciated.
Patrick