On Fri, Apr 28, 2023 at 08:59:29AM +0200, Florian Schmaus wrote: > On 27/04/2023 14.54, Michał Górny wrote: > > On Thu, 2023-04-27 at 09:58 +0200, Florian Schmaus wrote: > >> Disk space is cheap. > > > > No, it's not. Gentoo supports more hardware than your average PC with > > beefy hard drive and/or possibility of installing one. Let's not forget > > that you need a ::gentoo checkout even on a system running purely > > on binary packages. > > You are right. Gentoo supports a broad range of hardware in many > dimensions, e.g., architecture, release date, and composition. > > You seem to suggest that are Gentoo systems that can not handle the > additional disk space consumption of EGO_SUM Go-packages? > > I can not imagine systems that are able to deal with the ~500 MiB > ::gentoo repository, but would break if the same repository would > contain 100 additional Go-packages with 200 KiB each. > > Even under a "worst-case" assumption, where we would have 256 > Go-packages with each having a 1 MiB package-directory size, any system > that can handle the current state of ::gentoo should be able to take the > additional 256 MiB (+ metadata). This email ended up more rambling than I intended, but I wanted to get the data out there, and enable us to look deeper at the problems and potential impacts of the solutions.
Before the ideas and data I wanted to note the semi-conceptual ways to package new things that have many dependency artifacts (package or distfile). Distfile-heavy packages: ------------------------ A package declares many distfile dependencies, but very few package dependencies. The Manifest files in this case suffer a lot of duplication - but the growth is mostly limited to ::gentoo (or overlays). Any change of a package that leads to slightly different Manifest file, and while delta compression will reduce the growth factor, it's still large (dropping a version, adding a version, adding a remotely-fetched patch. Dependency-heavy packages: -------------------------- A package declares many package dependencies, with the distfile growth distributed over MANY packages. Major downside here is that build-depends consume a lot more space & inodes to install all the depends that are used for the ebuild, esp. when a given distfile might be used for only one package. Want to build a complex Go-based package? Debian/Ubuntu use this approach, and it shows might have to explicitly package 70+ dependencies to get something you want packaged. https://salsa.debian.org/go-team/packages/consul/-/blob/debian/sid/debian/control#L10-89 a quick back-of-napkin set of math show the Debian golang dep packages, as of 22.04 LTS: ~30% are a dep for only one package; a further 30% are a dep for only 2 packages. ---- With the above in mind, we see that it's not just the size of the Manifest, but the combinatorial problem of Manifest revisions, with the saving roll of Git's delta compression. I pulled a Git listing of every Manifest blob that was larger than 64KiB in Git history (excluding the historical conversion), and then go based on those: 2718 blobs in total, taking up ~516MiB, 1600056 DIST entries, for 166726 distinct distfiles. I tried to break those distfiles down, based on filename patterns, or where they occurred (sorted by number of distfiles here): 76075 dist-tex (all in the tex category) 33949 dist-mozilla (firefox*, thunderbird*) 19314 dist-office 17802 dist-golang (*%2F@v%2F* files; 10160 .mod, 7642 .zip) 10478 dist-rust (*.crate files) 3630 dist-other 1325 dist-jar-pom (*.jar, *.pom) 1020 dist-tablebase-syzygy (distfiles for a specific package) 981 dist-kde (kde manifests that met the threshold) 980 dist-kernel-and-genpatches 749 dist-tessdata (again specific packages) 424 dist-bash (specific packages) 166727 == total The Rust & Golang counts *are* lower bounds, because it's not trivial to take into account changes in packaging. However, the upper bound E.g. this distfile isn't immediately classifiable as Rust: d3d12-rs-a990c93ec64eeab78f2292763d0715da9dba1d59.gh.tar.gz To assume a worst case, assign the dist-other to the category of your choice. Ecosystems that are distfile-heavy, in order of Manifest sizes: TeX, Golang, Rust Packages that are distfile-heavy: LibreOffice/OpenOffice, Firefox, Thunderbird TeX has only a few packages, but the MOST distfiles. dev-texlive/texlive-latexextra/Manifest peaked over 6MB with 15480 entries. For all of Gentoo git history however, there have only been 19 revisions of that Manifest. For all TeX packages, 286 revisions of Manifests over 37 packages. Those 286 Manifest revisions clock in at ~94MB together before compression. The Mozilla packages have the next most distfiles: 4 packages, 768 manifest revisions, but the largest single Manifest was only 285519 bytes. ~88MB for all the manifest revision bytes together. The office packages (app-office/libreoffice-l10n & app-office/openoffice-bin) are similar to Mozilla stats overall, and not much to discuss. ~35MB for all Manifest revisions together. With those big 3 out the way, we're into Golang & Rust. Golang: 83 packages, 787 Manifest revisions. Largest manifest was sys-cluster/k3s/Manifest in blob f0e4d1761c0fe80a48b45007ad02024676490841, coming in just under 1MiB. However, the duplication of distfiles between Manifests *really* shows up: ~247MB for all Manifest revisions together. Rust: 48 packages, 543 Manifest revisions, largest Manifest was blob af989423f436338fb3e1d4193448dada5b9154da of app-shells/nushell/Manifest at 336646 bytes. ~64MB for all Manifest revisions together. --- End of data-analysis. The estimates of Manifest compression were fine as a baseline, but Git uses delta compression, and what tends to matter is the total number of unique lines in a repo. The expansion *does* matter when the Manifests are checked out at the same time. If we took the Debian approach, we'd minimize the number of times a given distfile has data repeated in Manifests, because it'd be abstracted a single dependency entry. The apparent downside is the significant increase in build-only dependencies that are rarely used. Previously I'd sketched an idea for out-of-tree Manifests, that hoisted many SRC_URI entries into a *versioned* Manifest artifact that wasn't present inside the tree, but had to be fetched & verified first, and then used to fetch & verify the actual distfiles. That Manifest, while relatively small, would be subject to some of hosting problems as the distfile dep tarballs presently used. However it *would* mean that the deps are much harder to tamper with (because they'd still come from the original upstreams). I do understand that overlays/non-main-trees however find the dep tarball concept to significantly impede packaging speed, and the out-of-tree Manifest will also cause friction (even if it were inside their overlay repo, it's still more work). To that end, and I know this will likely require significant PMS work, I think we need to look deeper at how to solve the underlying issues. Putting the "what" of entries into the *ebuild*, e.g. with EGO_SUM is still ideal from an ease-of-development and validation perspective. It's an accurate representation of the *artifacts* that a package depends on. Those artifacts might be packages, or external distfiles. Where it breaks down is the mapping of those artifacts into in-tree data: Duplication in Manifests, md5-metadata. How do we avoid that duplication? The most obvious version is moving the artifacts back to *some* form of package as much as it pains me. The crappy part there is we're going to end up packaging 2823 different Golang things, representing the 17802 distfiles. A smaller-on-checkout, larger-in-history solution would be moving common DIST entries to another Manifest, changing the way validation rules work. -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Treasurer E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
Description: PGP signature