I'm in the middle of building a patch for 718225, I'm having to think
carefully about caching at this point, and I think the area of caching
could do with a review (or at least a little input or agreement with my
proposals/plans from Daniel).

Obviously its too late to get too much improvement in this area into
jessie, but knowing the intended direction of improvement for a future
version could be very helpful in building this patch, (as well as
shaping the next version).

(tl;dr summary at end.)

Brief caching background:
-------------------------------------
The set of cachable files includes downloaded: packages (deb & udeb),
installer files (vmlinuz & initrd), and distribution information files
(release, packages.gz and contents-[arch].gz files). Additionally,
there's possible caching of certain completed build stages, but this is
mainly useful only for development of LB, with the exception of caching
the basic bootstrap chroot which is required by the current design of
the build process.

Caching scenarios include: During a single clean build (don't download
the same thing multiple times); Re-running a build process in a
directory where a build has previously taken place, without cleaning it
out, so potential for many files to be retrieved from the cache; and
doing an offline build, where absolutely everything will and must come
from the cache (a variation of building in a previously used directory;
and not to be confused with using your own local mirror, a completely
different thing!).

One further situation complicates things further. The description of the
--cache-packages parameter describes that disabling it is not
recommended, but in rare setups it is actually faster to re-download
(from a local mirror) rather than hit the disk.

Offline building:
-------------------------------------
The only reference to offline building I've seen in LB documentation is
within the description for the --cache-indices parameter ("would allow
to rebuild an image completely offline, however, you would not get
updates anymore", which was introduced in the v1.0 live-helper days. If
enabled, a few apt data files are retrieved from a cached copy if they
exist, rather than re-installing a few local keys and key packages (if
you have any in your config), and an install of aptitude is possibly
avoided. To me this param and bit of code are a little puzzling, adding
complexity to avoid very little work, and it's not even entirely clear
to me how it helps "offline" capabilities, unless only in relation to
the possible install of aptitude, which surely could be handled much
more cleanly. I can find nothing through google that suggests this
parameter is actually needed by anyone, just someone happening to notice
it break in an ancient bug report, and a mention of it in an old article.

Furthermore, offline building has actually been broken for at least two
and a half years now, since "support for including firmware packages
automatically" was included in v3.0-a47, unless you disable inclusion of
firmware packages (set --firmware-binary and --firmware-chroot to
false). That code has always lacked caching support, preventing offline
building. Clearly there seems to be no serious use for it.

My partially complete 718225 work does actually fix the lack of caching
in the 'firmware' related code, and thus it's possible that offline
building could be workable once again, but should I bother paying it any
thought? I'm still trying to figure out how I might best use caching in
implementing this patch; if we agreed on offline support being
unnecessary and ditching it, it might possibly make things a little less
complicated (and --cache-indices could perhaps be ripped out later
inline with that).

Update: Just noticed, downloading of mirror 'trace' files (placed in the
image as .disk/archive_trace) does not use caching, and has been in
place since v1.0.5-2, so offline building can't have worked since then!
(Unless perhaps that just silently fails).

No Caching (of package files)
-------------------------------------
(Downloading being faster than retrieving from disk). Apparently a
rarely needed capability, which has existed since around v1.0-a22.
Nothing on google about it, though I'm sure it does work, so maybe there
are people using it, but is anyone? Obviously there's going to be a fair
bit of disk activity during the build process anyway, this just reduces
it a little. Can the existence of this functionality really be
justified? It would be nice to remove unnecessary things like this to
keep the code cleaner, so perhaps it could be removed in the next
version. Does anyone seriously require this?

Freshness
-------------------------------------
For the installer stage of the build process, everything is retrieved
from the cache if a copy is already there, except (currently) the
contents-[arch].gz files used to get a list of firmware packages to
download. If you check out the installer images (e.g.
http://ftp.debian.org/debian/dists/sid/main/installer-amd64/), i.e. the
vmlinuz and initrd files used by the install process, you'll see that
they rarely, but occasionally are updated. Obviously if you use the
daily build it's more frequent. There's actually a security risk in
using the daily build (inadequate info files to securely download, which
my 718225 patch highlights to the user), so perhaps that's to be
avoided, but there's still a decent need for a user to want to replace
their cached installer files.

In terms of distribution information files used in a
secure-wget-download verification process in my 718225 patch, I intend
to use the cached copies, but if a failure occurs, download fresh copies
of any cached files used as a second chance, and only really fail if the
verification check after that fails also. So this will work perfectly
fine in all scenarios.

Contents-[arch].gz files used to get a list of firmware packages, and
Packages.gz, used to get a list of udebs may possibly need to be
updated, and also trace files (if caching where implemented for them).

So certain files discussed here should be used from the cache during a
build, to avoid unnecessary repeat downloads (e.g. contents-[arch].gz is
downloaded twice, once during chroot_firmware, and once during
installer_debian-installer), but the user may need choice of whether to
refresh them at the start of a new build, or whether to allow the cached
copies to be used.

How should this best be approached?? There are a few options:

 1. We could default to using the cached copies if available, but
    provide new flags to force a build to get new ones.
    (--[refresh/flush]-cached-dist-info and
    --[refresh/flush]-cached-[di/installer] ?).
 2. We could default to not using them, flushing them by default at the
    start of the build process, but provide a new flag
    (--use-cached-dist-info and --use-cached-[di/installer] ?) to avoid
    the flush and use them.
 3. We could just always flush them at the start of the build, and rely
    on caching during the build only.
 4. We could add a new option to the clean script, requiring users to
    run that to flush this info if they want it flushed.

Obviously the third is the simplest for users, but least efficient, and
the fourth perhaps requires the most forethought for each build. I'm not
certain which I'd consider best, but I think I'd lean towards one of the
first two.

As for implementing it, rather than having the first block of code
downloading a dist-info file doing a flush if required, it would be
cleaner to have a simple 'init' stage to the build process which does
that I think. Or possibly an init script executed at the start of the
bootstrap script.

Summary (tl;dr)
-------------------------------------
Proposal summary:

  * Ditch unnecessary and broken offline build support (disregard
    support in new code now, possibly remove --cache-indices in next
    version).
  * Ditch --cache-packages=false support in next version, if no
    justification for keeping it?
  * Implement one of the first two options under the 'freshness' heading
    above for user control over freshness of dist-info and installer
    (linuz and initrd) files used in a build, in relation to any copies
    available in the cache from a previous build.

Reply via email to