Hi
But, for me, even a trimmed-down Gentoo is still too large
(has to contain the whole base packages, from portage to
toolchain, includes, etc). I'd prefer having only the essential
runtime stuff within the containers.
I'm just building some embedded devices on the side using gentoo and my
minimal builds are only a few MB? Curious why you feel you need to move
from Gentoo to get the size smaller?
Seems like your complaint is that you have gentoo installs which are
full featured with a toolchain and portage, which you are comparing to
an installation you built with a different tool that doesn't have a
toolchain installed? However, you can do the same using gentoo if you
wish? (you just need a lightweight package installer to avoid installing
portage)
I think your main options are:
1) Build your base images without a toolchain or portage and use a
minimal package installer to install pre-built binary packages. This
seems fraught with issues long term though...
2) Build your base images without a toolchain, but with portage (and
perhaps a very minimal python). This gives you full dependency tracking
and obviously bind mount/nfs mount the actual portage tree to avoid
space used there. This seems workable and minimal?
3) If we are talking virtual machines then who cares if your containers
are individually quite large, if the files in them are duplicated across
all containers? Simply use an appropriate de-duplication strategy to
coalesce the space and most of the disadvantages disappear? eg
linux-vserver you can simply hardlink all the common files across your
installations and allow the COW patch to break hardlinks if anyone
alters a file in a single instance. Or you could use aufs to mount a
writeable layer over your common base VM instance? Or you could use one
of the filesystems which de-duplicates files in the background (some
caveats apply here to avoid memory still being used multiple times in
each VM). Or under KVM there is the memory coalescing feature which
merges similar code pages (forget it's name?)
Personally I think option 3) is quite interesting from the medium number
of virtual machines, ie in the 10s to hundreds, ie simply don't worry
about it, let the OS do the work. In the hundreds to thousands plus
level I guess you have unique challenges and I would be wrong to try and
suggest a solution from the comfort of a laptop without having that
responsibility, but I would have thought there was some advantage in a
very rigidly deployed base OS generated and updated very precisely?
For this we need a different approach (strictly separating build
and production environments). Binary distros (eg. Debian) might
be one option, but they're lacking the configurability and mostly
are still too large. So I'm going a different route using my own
buildsystem - called Briegel - which originally was designed for
embedded/small-device targets.
For now I didn't have the spare time to port all the packages
required for complete server systems (most of it is making
them all cleanly crosscompile'able, as this is a fundamental
concept of Briegel). But maybe you'd like to join in and try it :)
Sounds like an interesting challenge, but I'm unconvinced you can't
solve 90% of your problem within the constraints of Gentoo? This saves
you a bunch of time that could be invested in the last 10% through more
traditional means?
It does appear like managing large numbers of virtual machines is one
are that gentoo could score very well? Interested to see any chatter on
how others solve this problem, or any general advocacy? Probably we
should start a new thread though...
I'm not sure if Gentoo really is the right distro for that purpose,
as it's targeted to very different systems (i.g. Gentoo boxes are
expected to be quite unique, beginning with different per-package
useflags, even down to cflags, etc). But it might still be a good
basis for building specific system images (let's call them stage5 ;-))
I won't disagree on your "where it's targeted", but just to re-iterate
why I think Gentoo works well is that it does have a very workable
binary package feature!
My way of working is to use (several) shared binary package repos and
the guests largely pull from those shared package directories. In fact
what I do is have a minimal number of shared "/usr/portage/package"
directories and I mount an appropriate one to the guest type at boot
time. At the moment my main two options are "32bit" and "64bit" for the
package mounts, but I recently introduced a new machine type which is
held back to perl 5.8 and that guest gets it's own package mount since
it's obviously linking a lot of binaries differently
So, my process is to test an update on a small number of guests, either
dedicated test guests or less important live guests. If this looks good
then I run the upgrade against all other Vms of the same type and they
will update quickly from package binaries
Now, the icing is that this works extremely well even once you decide to
lightly customise machine types. So for example my binary packages are
very high level (eg 32/64bit), my "profiles" would be fairly high level,
eg I have www-apache and www-nginx base profiles. However, a specific
virtual machine running say nginx might itself need a specific PHP
application installed, and that itself might need some dependencies,
which in turn might require a specific set of customisation of use flags
and versions.
Now, the neat thing is that the binary upgrade options are *either* to
use *only* binary packages, OR to use binary packages *if* they were
built with the correct USE flags. So for example I haven't bothered to
split out my packages directory to be specific to the nginx/apache
machines, however, this causes the PHP package to be regularly rebuilt
depending on whether it was last used to upgrade an nginx or apache
guest (different use flags needed for each guest). I could fix this
easily enough, but it's not a problem for me and it's automatically
handled through the portage binary package updates
So the end result is that you can make efficient use of binary updates,
but portage will still customise the odd package here or there where a
local machine requires something which differs from the norm. To my eye
this keeps most of the benefits of an RPM/DEB style binary updater, with
the flexibility of a per machine, customised USE flag gentoo installation?
An setup for 100 equal webserver vm's could look like this:
* run a normal Gentoo vm (tailored for the webserver appliance),
where do you do regular updates (emerge, revdep-rebuild, etc, etc)
* from time to time take a snapshot, strip off the buildtime-only
stuff (hmm, could turn out to be a bit tricky ;-o)
* this stripped snapshot now goes into testing vm's
* when approved, the individual production vm's are switched over
to the new image (maybe using some mount magic, unionfs, etc)
This could work and perhaps for 100 identical Vms you have enough meat
to work on something quite customised anyway?
Personally for 20-80 identical VMs running very limited variety of web
software I would go for:
- Slightly cut down gentoo VM
- Hardlinked across all instances OR single installation which is read only
- Writeable data areas mounted to their own space (/var/www, /tmp,
/home, etc)
By separating the data from the OS you have a lot of flexibility to
upgrade the base webserver install and mount the data back on the new
VM? With linux-vservers or other container style, you will find that
the OS shares code segments across all virtual machines (due to all
files sharing the same inode) and the memory usage should be much lower
and nearer to firing up an instance of the shared app and it then
forking (ie data is duplicated, but the code segment is shared)
For 100+ Vms I guess I would be looking very strongly at a common
read-only OS partition and container style virtualisation
For 20-80 near identical VMs, but running a wider variety of web
software I would go for the hardlinked option with a straightforward
"emerge" upgrade option across them. Hardlinking keeps the memory usage
sane where possible, without the pain of trying to keep the base install
absolutely identical and read-only to make the common mount option work?
At this point I've got a question for to the other folks here:
emerge has an --root option which allows to (un)merge in a separate
system image. So it should be possible to unmerge a lot of system
packages which are just required for updating/building (even
portage itself), but this still will be manual - what about
dependency handling ?
This is correct. In fact this is how you build a stage 1,2,3 etc and
how catalyst works!
The information is a bit spread out over several out of date wiki
articles, but perhaps start with:
http://en.gentoo-wiki.com/wiki/Tiny_Gentoo
Roughly speaking you could "freshen" your current installation with
(from memory):
ROOT="/tmp/new_build" emerge -av world
This has minor gremlins when I test it, probably due to some symlinks
being created differently if you follow the current catalyst build
script through stage 1,2,3 etc, but roughly speaking it does the same
thing only jumping straight to the end result and building a completely
new identical install to your current OS...
Even more special is that you can set an alternative portage source, so
if you want to build your new ROOT with alternative make.conf,
/etc/portage/*, etc then just put your new files somewhere and set
PORTAGE_CONFIGROOT to point to it. Cross compiling is also done through
an extension of this basic method
So, following your chain of thought - yes it's not too hard to quickly
generate a customised base OS installation to use for your future VMs.
Further, if you wish you can make those VMs have a reduced or missing
toolchain etc. In fact if you google a bit I think you will find some
recipes for very minimal VMs using this method where the base VM is a
very minimal install...
Is there some way to drop at least parts of the standard system set,
so eg. portage, python, gcc, etc, etc get unmerged by --depclean
if nobody else (in world set) doesn't explicitly require them ?
You are almost thinking about it all wrong. ("There is no spoon...")
This is gentoo, so at this more advanced level, stop thinking about
"standard system set" and instead free your mind to start with
"nothing". Go read the old bootstrap from stage 1 instructions, plus
the TinyGentoo pages and you can quickly see that Catalyst builds your
working installation by starting from a working installation, creating
an empty directory, adding some minimal packages to that directory and
building up from there.
So absolutely nothing stops you from just starting with an empty
directory and just emerging a few basic packages into it (couple MB) and
then chrooting into it and having some fun... There is *no* minimal
package set, you can install whatever you want (as long as it boots).
Largely the portage dependency tracker will help you pull in the minimal
needed dependencies, but beware that system packages arent generally
explicitly tracked so you may stumble across some deps when you are
going really basic and omiting standard system packages (just use common
sense: it should be fairly obvious if an application requires a compiler
and you didn't install one then you have a conflict of interest...)
Have another look at gentoo! I definitely believe that it's flexibility
to build you highly customised packages, plus strong templating of those
packages, plus decent ability to distribute binaries of the end result
is a very strong combo! Better binary support is really the only thing
missing here, but it's pretty adequate as it stands!
Good luck
Ed W