Hi

But, for me, even a trimmed-down Gentoo is still too large
(has to contain the whole base packages, from portage to
toolchain, includes, etc). I'd prefer having only the essential
runtime stuff within the containers.

I'm just building some embedded devices on the side using gentoo and my minimal builds are only a few MB? Curious why you feel you need to move from Gentoo to get the size smaller?

Seems like your complaint is that you have gentoo installs which are full featured with a toolchain and portage, which you are comparing to an installation you built with a different tool that doesn't have a toolchain installed? However, you can do the same using gentoo if you wish? (you just need a lightweight package installer to avoid installing portage)

I think your main options are:

1) Build your base images without a toolchain or portage and use a minimal package installer to install pre-built binary packages. This seems fraught with issues long term though...

2) Build your base images without a toolchain, but with portage (and perhaps a very minimal python). This gives you full dependency tracking and obviously bind mount/nfs mount the actual portage tree to avoid space used there. This seems workable and minimal?

3) If we are talking virtual machines then who cares if your containers are individually quite large, if the files in them are duplicated across all containers? Simply use an appropriate de-duplication strategy to coalesce the space and most of the disadvantages disappear? eg linux-vserver you can simply hardlink all the common files across your installations and allow the COW patch to break hardlinks if anyone alters a file in a single instance. Or you could use aufs to mount a writeable layer over your common base VM instance? Or you could use one of the filesystems which de-duplicates files in the background (some caveats apply here to avoid memory still being used multiple times in each VM). Or under KVM there is the memory coalescing feature which merges similar code pages (forget it's name?)

Personally I think option 3) is quite interesting from the medium number of virtual machines, ie in the 10s to hundreds, ie simply don't worry about it, let the OS do the work. In the hundreds to thousands plus level I guess you have unique challenges and I would be wrong to try and suggest a solution from the comfort of a laptop without having that responsibility, but I would have thought there was some advantage in a very rigidly deployed base OS generated and updated very precisely?


For this we need a different approach (strictly separating build
and production environments). Binary distros (eg. Debian) might
be one option, but they're lacking the configurability and mostly
are still too large. So I'm going a different route using my own
buildsystem - called Briegel - which originally was designed for
embedded/small-device targets.

For now I didn't have the spare time to port all the packages
required for complete server systems (most of it is making
them all cleanly crosscompile'able, as this is a fundamental
concept of Briegel). But maybe you'd like to join in and try it :)

Sounds like an interesting challenge, but I'm unconvinced you can't solve 90% of your problem within the constraints of Gentoo? This saves you a bunch of time that could be invested in the last 10% through more traditional means?


It does appear like managing large numbers of virtual machines is one
are that gentoo could score very well?  Interested to see any chatter on
how others solve this problem, or any general advocacy?  Probably we
should start a new thread though...
I'm not sure if Gentoo really is the right distro for that purpose,
as it's targeted to very different systems (i.g. Gentoo boxes are
expected to be quite unique, beginning with different per-package
useflags, even down to cflags, etc). But it might still be a good
basis for building specific system images (let's call them stage5 ;-))

I won't disagree on your "where it's targeted", but just to re-iterate why I think Gentoo works well is that it does have a very workable binary package feature!

My way of working is to use (several) shared binary package repos and the guests largely pull from those shared package directories. In fact what I do is have a minimal number of shared "/usr/portage/package" directories and I mount an appropriate one to the guest type at boot time. At the moment my main two options are "32bit" and "64bit" for the package mounts, but I recently introduced a new machine type which is held back to perl 5.8 and that guest gets it's own package mount since it's obviously linking a lot of binaries differently

So, my process is to test an update on a small number of guests, either dedicated test guests or less important live guests. If this looks good then I run the upgrade against all other Vms of the same type and they will update quickly from package binaries

Now, the icing is that this works extremely well even once you decide to lightly customise machine types. So for example my binary packages are very high level (eg 32/64bit), my "profiles" would be fairly high level, eg I have www-apache and www-nginx base profiles. However, a specific virtual machine running say nginx might itself need a specific PHP application installed, and that itself might need some dependencies, which in turn might require a specific set of customisation of use flags and versions.

Now, the neat thing is that the binary upgrade options are *either* to use *only* binary packages, OR to use binary packages *if* they were built with the correct USE flags. So for example I haven't bothered to split out my packages directory to be specific to the nginx/apache machines, however, this causes the PHP package to be regularly rebuilt depending on whether it was last used to upgrade an nginx or apache guest (different use flags needed for each guest). I could fix this easily enough, but it's not a problem for me and it's automatically handled through the portage binary package updates

So the end result is that you can make efficient use of binary updates, but portage will still customise the odd package here or there where a local machine requires something which differs from the norm. To my eye this keeps most of the benefits of an RPM/DEB style binary updater, with the flexibility of a per machine, customised USE flag gentoo installation?


An setup for 100 equal webserver vm's could look like this:

* run a normal Gentoo vm (tailored for the webserver appliance),
   where do you do regular updates (emerge, revdep-rebuild, etc, etc)
* from time to time take a snapshot, strip off the buildtime-only
   stuff (hmm, could turn out to be a bit tricky ;-o)
* this stripped snapshot now goes into testing vm's
* when approved, the individual production vm's are switched over
   to the new image (maybe using some mount magic, unionfs, etc)

This could work and perhaps for 100 identical Vms you have enough meat to work on something quite customised anyway?

Personally for 20-80 identical VMs running very limited variety of web software I would go for:
- Slightly cut down gentoo VM
- Hardlinked across all instances OR single installation which is read only
- Writeable data areas mounted to their own space (/var/www, /tmp, /home, etc)

By separating the data from the OS you have a lot of flexibility to upgrade the base webserver install and mount the data back on the new VM? With linux-vservers or other container style, you will find that the OS shares code segments across all virtual machines (due to all files sharing the same inode) and the memory usage should be much lower and nearer to firing up an instance of the shared app and it then forking (ie data is duplicated, but the code segment is shared)


For 100+ Vms I guess I would be looking very strongly at a common read-only OS partition and container style virtualisation

For 20-80 near identical VMs, but running a wider variety of web software I would go for the hardlinked option with a straightforward "emerge" upgrade option across them. Hardlinking keeps the memory usage sane where possible, without the pain of trying to keep the base install absolutely identical and read-only to make the common mount option work?


At this point I've got a question for to the other folks here:

emerge has an --root option which allows to (un)merge in a separate
system image. So it should be possible to unmerge a lot of system
packages which are just required for updating/building (even
portage itself), but this still will be manual - what about
dependency handling ?

This is correct. In fact this is how you build a stage 1,2,3 etc and how catalyst works!

The information is a bit spread out over several out of date wiki articles, but perhaps start with:
    http://en.gentoo-wiki.com/wiki/Tiny_Gentoo

Roughly speaking you could "freshen" your current installation with (from memory):
    ROOT="/tmp/new_build" emerge -av world

This has minor gremlins when I test it, probably due to some symlinks being created differently if you follow the current catalyst build script through stage 1,2,3 etc, but roughly speaking it does the same thing only jumping straight to the end result and building a completely new identical install to your current OS...

Even more special is that you can set an alternative portage source, so if you want to build your new ROOT with alternative make.conf, /etc/portage/*, etc then just put your new files somewhere and set PORTAGE_CONFIGROOT to point to it. Cross compiling is also done through an extension of this basic method

So, following your chain of thought - yes it's not too hard to quickly generate a customised base OS installation to use for your future VMs. Further, if you wish you can make those VMs have a reduced or missing toolchain etc. In fact if you google a bit I think you will find some recipes for very minimal VMs using this method where the base VM is a very minimal install...

Is there some way to drop at least parts of the standard system set,
so eg. portage, python, gcc, etc, etc get unmerged by --depclean
if nobody else (in world set) doesn't explicitly require them ?

You are almost thinking about it all wrong.  ("There is no spoon...")

This is gentoo, so at this more advanced level, stop thinking about "standard system set" and instead free your mind to start with "nothing". Go read the old bootstrap from stage 1 instructions, plus the TinyGentoo pages and you can quickly see that Catalyst builds your working installation by starting from a working installation, creating an empty directory, adding some minimal packages to that directory and building up from there.

So absolutely nothing stops you from just starting with an empty directory and just emerging a few basic packages into it (couple MB) and then chrooting into it and having some fun... There is *no* minimal package set, you can install whatever you want (as long as it boots). Largely the portage dependency tracker will help you pull in the minimal needed dependencies, but beware that system packages arent generally explicitly tracked so you may stumble across some deps when you are going really basic and omiting standard system packages (just use common sense: it should be fairly obvious if an application requires a compiler and you didn't install one then you have a conflict of interest...)


Have another look at gentoo! I definitely believe that it's flexibility to build you highly customised packages, plus strong templating of those packages, plus decent ability to distribute binaries of the end result is a very strong combo! Better binary support is really the only thing missing here, but it's pretty adequate as it stands!

Good luck

Ed W


Reply via email to