On Fri, Apr 28, 2023 at 08:59:29AM +0200, Florian Schmaus wrote:
> On 27/04/2023 14.54, Michał Górny wrote:
> > On Thu, 2023-04-27 at 09:58 +0200, Florian Schmaus wrote:
> >> Disk space is cheap.
> > 
> > No, it's not.  Gentoo supports more hardware than your average PC with
> > beefy hard drive and/or possibility of installing one.  Let's not forget
> > that you need a ::gentoo checkout even on a system running purely
> > on binary packages.
> You are right. Gentoo supports a broad range of hardware in many 
> dimensions, e.g., architecture, release date, and composition.
> You seem to suggest that are Gentoo systems that can not handle the 
> additional disk space consumption of EGO_SUM Go-packages?
> I can not imagine systems that are able to deal with the ~500 MiB 
> ::gentoo repository, but would break if the same repository would 
> contain 100 additional Go-packages with 200 KiB each.
> Even under a "worst-case" assumption, where we would have 256 
> Go-packages with each having a 1 MiB package-directory size, any system 
> that can handle the current state of ::gentoo should be able to take the 
> additional 256 MiB (+ metadata).
This email ended up more rambling than I intended, but I wanted to get the data
out there, and enable us to look deeper at the problems and potential impacts
of the solutions.

Before the ideas and data I wanted to note the semi-conceptual ways to package
new things that have many dependency artifacts (package or distfile).

Distfile-heavy packages:
A package declares many distfile dependencies, but very few package
dependencies. The Manifest files in this case suffer a lot of
duplication - but the growth is mostly limited to ::gentoo (or

Any change of a package that leads to slightly different Manifest file,
and while delta compression will reduce the growth factor, it's still
large (dropping a version, adding a version, adding a remotely-fetched patch.

Dependency-heavy packages:
A package declares many package dependencies, with the distfile growth
distributed over MANY packages. Major downside here is that
build-depends consume a lot more space & inodes to install all the
depends that are used for the ebuild, esp. when a given distfile might
be used for only one package. Want to build a complex Go-based package?
Debian/Ubuntu use this approach, and it shows might have to explicitly
package 70+ dependencies to get something you want packaged.
a quick back-of-napkin set of math show the Debian golang dep packages,
as of 22.04 LTS: ~30% are a dep for only one package; a further 30% are
a dep for only 2 packages.

With the above in mind, we see that it's not just the size of the Manifest, but
the combinatorial problem of Manifest revisions, with the saving roll of Git's
delta compression.

I pulled a Git listing of every Manifest blob that was larger than 64KiB
in Git history (excluding the historical conversion), and then go based
on those: 2718 blobs in total, taking up ~516MiB, 1600056 DIST entries,
for 166726 distinct distfiles.

I tried to break those distfiles down, based on filename patterns, or where
they occurred (sorted by number of distfiles here):
  76075 dist-tex (all in the tex category)
  33949 dist-mozilla (firefox*, thunderbird*)
  19314 dist-office 
  17802 dist-golang (*%2F@v%2F* files; 10160 .mod, 7642 .zip)
  10478 dist-rust (*.crate files)
   3630 dist-other
   1325 dist-jar-pom (*.jar, *.pom)
   1020 dist-tablebase-syzygy (distfiles for a specific package)
    981 dist-kde (kde manifests that met the threshold)
    980 dist-kernel-and-genpatches
    749 dist-tessdata (again specific packages)
    424 dist-bash (specific packages)
 166727 == total

The Rust & Golang counts *are* lower bounds, because it's not trivial to
take into account changes in packaging. However, the upper bound 
E.g. this distfile isn't immediately classifiable as Rust:
To assume a worst case, assign the dist-other to the category of  your choice.

Ecosystems that are distfile-heavy, in order of Manifest sizes: TeX, Golang, 
Packages that are distfile-heavy: LibreOffice/OpenOffice, Firefox, Thunderbird

TeX has only a few packages, but the MOST distfiles.
dev-texlive/texlive-latexextra/Manifest peaked over 6MB with 15480 entries. For
all of Gentoo git history however, there have only been 19 revisions of that
Manifest. For all TeX packages, 286 revisions of Manifests over 37 packages.
Those 286 Manifest revisions clock in at ~94MB together before compression.

The Mozilla packages have the next most distfiles:
4 packages, 768 manifest revisions, but the largest single Manifest was only 
285519 bytes.
~88MB for all the manifest revision bytes together.

The office packages (app-office/libreoffice-l10n & app-office/openoffice-bin)
are similar to Mozilla stats overall, and not much to discuss.
~35MB for all Manifest revisions together.

With those big 3 out the way, we're into Golang & Rust.
83 packages, 787 Manifest revisions. Largest manifest was
sys-cluster/k3s/Manifest in blob f0e4d1761c0fe80a48b45007ad02024676490841,
coming in just under 1MiB. However, the duplication of distfiles between 
Manifests *really* shows up:
~247MB for all Manifest revisions together.

48 packages, 543 Manifest revisions, largest Manifest was blob
af989423f436338fb3e1d4193448dada5b9154da of app-shells/nushell/Manifest at
336646 bytes. ~64MB for all Manifest revisions together.

--- End of data-analysis.

The estimates of Manifest compression were fine as a baseline, but Git uses
delta compression, and what tends to matter is the total number of unique
lines in a repo. The expansion *does* matter when the Manifests are checked out
at the same time.

If we took the Debian approach, we'd minimize the number of times a given
distfile has data repeated in Manifests, because it'd be abstracted a single
dependency entry. The apparent downside is the significant increase in
build-only dependencies that are rarely used.

Previously I'd sketched an idea for out-of-tree Manifests, that hoisted many
SRC_URI entries into a *versioned* Manifest artifact that wasn't present inside
the tree, but had to be fetched & verified first, and then used to fetch &
verify the actual distfiles.

That Manifest, while relatively small, would be subject to some of hosting
problems as the distfile dep tarballs presently used. However it *would* mean
that the deps are much harder to tamper with (because they'd still come from
the original upstreams).

I do understand that overlays/non-main-trees however find the dep tarball
concept to significantly impede packaging speed, and the out-of-tree Manifest
will also cause friction (even if it were inside their overlay repo, it's still
more work).

To that end, and I know this will likely require significant PMS work, I think
we need to look deeper at how to solve the underlying issues.

Putting the "what" of entries into the *ebuild*, e.g. with EGO_SUM is still
ideal from an ease-of-development and validation perspective. It's an accurate
representation of the *artifacts* that a package depends on. Those artifacts
might be packages, or external distfiles.

Where it breaks down is the mapping of those artifacts into in-tree data:
Duplication in Manifests, md5-metadata.

How do we avoid that duplication? The most obvious version is moving the
artifacts back to *some* form of package as much as it pains me.

The crappy part there is we're going to end up packaging 2823 different Golang
things, representing the 17802 distfiles.

A smaller-on-checkout, larger-in-history solution would be moving common DIST
entries to another Manifest, changing the way validation rules work.

Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

Attachment: signature.asc
Description: PGP signature

Reply via email to