Re: [gentoo-dev] Proposal to undeprecate EGO_SUM
On Fri, Sep 30, 2022 at 12:49:02PM -0700, Alec Warner wrote: > On Fri, Sep 30, 2022 at 7:53 AM Florian Schmaus wrote: > > > > On 30/09/2022 02.36, William Hubbs wrote: > > > On Wed, Sep 28, 2022 at 06:31:39PM +0200, Ulrich Mueller wrote: > > >>> On Wed, 28 Sep 2022, Florian Schmaus wrote: > > >>> 2.) the number of EGO_SUM entries exceeds 1000 and a Gentoo developer > > >>> maintains the package > > >>> 3.) the number of EGO_SUM entries exceeds 1500 and a proxied > > >>> maintainer maintains the package > > >> > > >> These numbers seem quite large, compared to the mean number of 3.4 > > >> distfiles for packages in the Gentoo repository. (The median and the > > >> 99-percentile are 1 and 22, respectively.) > > > > The numbers may appear large when compared to the whole tree, but I > > think a fair comparison would be within the related programming language > > ecosystem, e.g., Golang or Rust. > > > > For example, analyzing ::gentoo yields the following histogram for > > 2022-01-01: > > https://dev.gentoo.org/~flow/ego_sum_entries_histogram-2020-01-01.png > > > > > > > To stay with your example, restic has a 300k manifest, multiple 30k+ > > > ebuilds and897 distfiles. > > > > > > I'm thinking the limit would have to be much lower. Say, around 256 > > > entries in EGO_SUM_SRC_URI. > > > > A limit of 256 appears to be to low to be of any use. It is slightly > > above the 50th percentile, half of the packages could not use it. > > > > We have to realize that programming language ecosystems that only build > > static binaries tend to produce software projects that have a large > > number of dependencies. For example, app-misc/broot, a tool written in > > Rust, has currently 310 entries in its Manifest. Why should we threat > > one programming language different from another? Will be see voices that > > ask for banning Rust packages in ::gentoo in the future? With the rising > > popularity of Golang and Rust, we will (hopefully) only ever see an > > increase of such packages in ::gentoo. And most existing packages in > > this category will at best keep their dependency count constant, but are > > also likely to accumulate further dependencies over time. > > > > And quite frankly, I don't see a problem with "large" Manifests and/or > > ebuilds. Yes, it means our FTPs are hosting many files, in some cases > > even many small files. And yes, it means that in some cases ebuild > > parsing takes a bit longer. But I spoke with a few developers in the > > past few months and was not presented with any real world issues that > > EGO_SUM caused. If someone wants to fill in here, then now is a good > > time to speak up. But my impression is that the arguments against > > EGO_SUM are mostly of cosmetic nature. Again, please correct me if I am > > wrong. > > I thought the problem was that EGO_SUM ends up in SRC_URI, which ends > up in A. A ends up in the environment, and then exec() fails with > E2BIG because there is an imposed limit on environment variables (and > also command line argument length.) > > Did this get fixed? > > https://bugs.gentoo.org/719202 You are correct this was part of the issue as well. I don't know what the status of this bug is. William signature.asc Description: PGP signature
Re: [gentoo-dev] Proposal to undeprecate EGO_SUM
On Fri, Sep 30, 2022 at 10:07:44PM +0200, Arsen Arsenović wrote: > Hey, > > On Friday, 30 September 2022 02:36:05 CEST William Hubbs wrote: > > I don't know for certain about a vendor tarball, but I do know there > > are instances where a vendor tarball wouldn't work. > > app-containers/containerd is a good example of this, That is why the > > vendor tarball idea was dropped. > It is indeed not possible to verify vendor tarballs[1]. The proposed > solution Go people had would also require network access. > > > Upstream doesn't need to provide a tarball, just an up-to-date > > "vendor" directory at the top level of the project. Two examples that > > do this are docker and kubernetes. > Upstreams doing this sounds like a mess, because then they'd have to > maintain multiple source trees in their repositories, if I understand > what you mean. Well, there isn't a lot of work involved in this for upstream, they just run: $ go mod vendor at the top level of their project and keep that directory in sync in their vcs. The down side is it can be big and some upstreams do not want to do it. > > An alternative to vendor tarballs is modcache tarballs. These are > absolutely massive (~20 times larger IIRC), though, they are verifiable. The modcache tarballs are what I'm calling dependency tarballs, and yes they are bigger than vendor tarballs and verifiable. Also, the go-module eclass sets the GOMODCACHE environment variable to point to the directory where the contents of the dependency tarball ends up which makes it easy for the go tooling to just use the information in that directory. If we can get bug https://bugs.gentoo.org/833567 to happen in eapi 9, that would solve all of this. The next step after I got that to happen would be to put a shared go module cache in, for example, "${DISTDIR}/go-mod", so that all go modules from packages would be downloaded there, and they would be consumed like all distfiles are. > opinion: I see no way around it. Vendor tarballs are the way to go. For > trivial cases, this can likely be EGO_SUM, but it scales exceedingly > poorly, to the point of the trivial case being a very small percentage > of Go packages. I proposed authenticated automation on Gentoo > infrastructure as a solution to this, and implemented (a slow and > unreliable) proof of concept (posted previously). The obvious question > of "how will proxy maintainers deal with this" is also relatively > simple: giving them authorization for a subset of packages that they'd > need to work on. This is an obvious increase in the barrier of entry for > fresh proxy maintainers, but it's still likely less than needing > maintainers to rework ebuilds to use vendor tarballs on dev.g.o. Vendor tarballs are not complete. The best example of this I see in the tree is app-containers/containerd. If you try to build that with a vendor tarball instead of a dependency tarball, the build will break, but it works with a dependency tarball. William > > > [1]: https://github.com/golang/go/issues/27348 > -- > Arsen Arsenović signature.asc Description: PGP signature
Re: [gentoo-dev] Proposal to undeprecate EGO_SUM
Hey, On Friday, 30 September 2022 02:36:05 CEST William Hubbs wrote: > I don't know for certain about a vendor tarball, but I do know there > are instances where a vendor tarball wouldn't work. > app-containers/containerd is a good example of this, That is why the > vendor tarball idea was dropped. It is indeed not possible to verify vendor tarballs[1]. The proposed solution Go people had would also require network access. > Upstream doesn't need to provide a tarball, just an up-to-date > "vendor" directory at the top level of the project. Two examples that > do this are docker and kubernetes. Upstreams doing this sounds like a mess, because then they'd have to maintain multiple source trees in their repositories, if I understand what you mean. An alternative to vendor tarballs is modcache tarballs. These are absolutely massive (~20 times larger IIRC), though, they are verifiable. opinion: I see no way around it. Vendor tarballs are the way to go. For trivial cases, this can likely be EGO_SUM, but it scales exceedingly poorly, to the point of the trivial case being a very small percentage of Go packages. I proposed authenticated automation on Gentoo infrastructure as a solution to this, and implemented (a slow and unreliable) proof of concept (posted previously). The obvious question of "how will proxy maintainers deal with this" is also relatively simple: giving them authorization for a subset of packages that they'd need to work on. This is an obvious increase in the barrier of entry for fresh proxy maintainers, but it's still likely less than needing maintainers to rework ebuilds to use vendor tarballs on dev.g.o. [1]: https://github.com/golang/go/issues/27348 -- Arsen Arsenović signature.asc Description: This is a digitally signed message part.
Re: [gentoo-dev] Proposal to undeprecate EGO_SUM
On Fri, Sep 30, 2022 at 7:53 AM Florian Schmaus wrote: > > On 30/09/2022 02.36, William Hubbs wrote: > > On Wed, Sep 28, 2022 at 06:31:39PM +0200, Ulrich Mueller wrote: > >>> On Wed, 28 Sep 2022, Florian Schmaus wrote: > >>> 2.) the number of EGO_SUM entries exceeds 1000 and a Gentoo developer > >>> maintains the package > >>> 3.) the number of EGO_SUM entries exceeds 1500 and a proxied > >>> maintainer maintains the package > >> > >> These numbers seem quite large, compared to the mean number of 3.4 > >> distfiles for packages in the Gentoo repository. (The median and the > >> 99-percentile are 1 and 22, respectively.) > > The numbers may appear large when compared to the whole tree, but I > think a fair comparison would be within the related programming language > ecosystem, e.g., Golang or Rust. > > For example, analyzing ::gentoo yields the following histogram for > 2022-01-01: > https://dev.gentoo.org/~flow/ego_sum_entries_histogram-2020-01-01.png > > > > To stay with your example, restic has a 300k manifest, multiple 30k+ > > ebuilds and897 distfiles. > > > > I'm thinking the limit would have to be much lower. Say, around 256 > > entries in EGO_SUM_SRC_URI. > > A limit of 256 appears to be to low to be of any use. It is slightly > above the 50th percentile, half of the packages could not use it. > > We have to realize that programming language ecosystems that only build > static binaries tend to produce software projects that have a large > number of dependencies. For example, app-misc/broot, a tool written in > Rust, has currently 310 entries in its Manifest. Why should we threat > one programming language different from another? Will be see voices that > ask for banning Rust packages in ::gentoo in the future? With the rising > popularity of Golang and Rust, we will (hopefully) only ever see an > increase of such packages in ::gentoo. And most existing packages in > this category will at best keep their dependency count constant, but are > also likely to accumulate further dependencies over time. > > And quite frankly, I don't see a problem with "large" Manifests and/or > ebuilds. Yes, it means our FTPs are hosting many files, in some cases > even many small files. And yes, it means that in some cases ebuild > parsing takes a bit longer. But I spoke with a few developers in the > past few months and was not presented with any real world issues that > EGO_SUM caused. If someone wants to fill in here, then now is a good > time to speak up. But my impression is that the arguments against > EGO_SUM are mostly of cosmetic nature. Again, please correct me if I am > wrong. I thought the problem was that EGO_SUM ends up in SRC_URI, which ends up in A. A ends up in the environment, and then exec() fails with E2BIG because there is an imposed limit on environment variables (and also command line argument length.) Did this get fixed? https://bugs.gentoo.org/719202 > > - Flow
Re: [gentoo-dev] Proposal to undeprecate EGO_SUM
> On 30 Sep 2022, at 15:53, Florian Schmaus wrote: > > On 30/09/2022 02.36, William Hubbs wrote: >> On Wed, Sep 28, 2022 at 06:31:39PM +0200, Ulrich Mueller wrote: On Wed, 28 Sep 2022, Florian Schmaus wrote: 2.) the number of EGO_SUM entries exceeds 1000 and a Gentoo developer maintains the package 3.) the number of EGO_SUM entries exceeds 1500 and a proxied maintainer maintains the package >>> >>> These numbers seem quite large, compared to the mean number of 3.4 >>> distfiles for packages in the Gentoo repository. (The median and the >>> 99-percentile are 1 and 22, respectively.) > > The numbers may appear large when compared to the whole tree, but I think a > fair comparison would be within the related programming language ecosystem, > e.g., Golang or Rust. > > For example, analyzing ::gentoo yields the following histogram for 2022-01-01: > https://dev.gentoo.org/~flow/ego_sum_entries_histogram-2020-01-01.png > > >> To stay with your example, restic has a 300k manifest, multiple 30k+ >> ebuilds and897 distfiles. >> I'm thinking the limit would have to be much lower. Say, around 256 >> entries in EGO_SUM_SRC_URI. > > A limit of 256 appears to be to low to be of any use. It is slightly above > the 50th percentile, half of the packages could not use it. > > We have to realize that programming language ecosystems that only build > static binaries tend to produce software projects that have a large number of > dependencies. For example, app-misc/broot, a tool written in Rust, has > currently 310 entries in its Manifest. Why should we threat one programming > language different from another? Will be see voices that ask for banning Rust > packages in ::gentoo in the future? With the rising popularity of Golang and > Rust, we will (hopefully) only ever see an increase of such packages in > ::gentoo. And most existing packages in this category will at best keep their > dependency count constant, but are also likely to accumulate further > dependencies over time. > > And quite frankly, I don't see a problem with "large" Manifests and/or > ebuilds. Yes, it means our FTPs are hosting many files, in some cases even > many small files. And yes, it means that in some cases ebuild parsing takes a > bit longer. But I spoke with a few developers in the past few months and was > not presented with any real world issues that EGO_SUM caused. If someone > wants to fill in here, then now is a good time to speak up. But my impression > is that the arguments against EGO_SUM are mostly of cosmetic nature. Again, > please correct me if I am wrong. > I need to re-read the whole set of new messages in this thread, but there's still the issue of xargs/command length limits from huge variable contents. Best, sam signature.asc Description: Message signed with OpenPGP
Re: [gentoo-dev] Proposal to undeprecate EGO_SUM
On Wed, 2022-09-28 at 17:28 +0200, Florian Schmaus wrote: > > I would like to continue discussing whether we should entirely > > > deprecate > > EGO_SUM without the desire to offend anyone. > > > > We now have a pending GitHub PR that bumps restic to 0.14 [1]. > > Restic > is > > a very popular backup software written in Go. The PR drops EGO_SUM > > in > > favor of a vendor tarball created by the proxied maintainer. > > However, > I > > am unaware of any tool that lets you practically audit the 35 MiB > > > source > > contained in the tarball. And even if such a tool exists, this > > would > > mean another manual step is required, which is, potentially, > > skipped > > most of the time, weakening our user's security. This is because I > > believe neither our tooling, e.g., go-mod.eclass, nor any Golang > > tooling, does authenticate the contents of the vendor tarball > > against > > upstream's go.sum. But please correct me if I am wrong. > > > > I wonder if we can reach consensus around un-depreacting EGO_SUM, > > but > > discouraging its usage in certain situations. That is, provide > > > EGO_SUM > > as option but disallow its use if > > 1.) *upstream* provides a vendor tarball > > 2.) the number of EGO_SUM entries exceeds 1000 and a Gentoo > > developer > > maintains the package > > 3.) the number of EGO_SUM entries exceeds 1500 and a proxied > > > maintainer > > maintains the package > > > > In case of 3, I would encourage proxy maintainers to create and > > > provide > > the vendor tarball. > > > > The suggested EGO_SUM limits result from a histogram that I created > > analyzing ::gentoo at 2022-01-01, i.e., a few months before EGO_SUM > > > was > > deprecated. I think those numbers are too large but overall I think bringing back EGO_SUM in limited form is a good move, because it allows packaging go ebuilds in an easy and audit-able way. If you have vendor tarball - it's completely opaque before you unpack. With EGO_SUM you could parse ebuilds using that and scan for vulnerable go modules. and ofc vendored source hosting is a problem >From rust's team perspective ( we use CRATES, which is EGO_SUM inspiration, but _much_ more compact one) - I'd say take largest rust ebuild and allow as much as that or slightly more. x11-terms/alacritty is one of largest and CRATES number of lines is about 210 per 1 ebuild. So I'd say set maximum EGO_SUM size to 256 for ::gentoo, or maybe 512, remove limit for overlays completely. and introduce a hard die() in eclass if EGO_SUM is larger than that. not sure if you can detect repo name in eclass. In that case pkgcheck and CI could enforce that as fat warnings or errors. 256/512 limitation will not impose limit on manifest directly, but if you have 5 versions of max 256/512 EGO_SUM loc - it'll be more reasonable than 5 versions of max 1500 EGO_SUM loc. rust/cargo ebuild will still produce more compact Manifest given same amount of lines though, so it's still not directly comparable. currently we have 3 versions of alacritty which uses 407 unique crates across 3 versions. Manifest size is about 120K, which is 20th largest in ::gentoo It's nothing compared to 2.5MB manifests we used to have in some of the largest go packages. > > > > - Flow > > > > 1: https://github.com/gentoo/gentoo/pull/27050 > >
Re: [gentoo-dev] Proposal to undeprecate EGO_SUM
On Fri, Sep 30, 2022 at 04:53:39PM +0200, Florian Schmaus wrote: > On 30/09/2022 02.36, William Hubbs wrote: > > On Wed, Sep 28, 2022 at 06:31:39PM +0200, Ulrich Mueller wrote: > >>> On Wed, 28 Sep 2022, Florian Schmaus wrote: > >>> 2.) the number of EGO_SUM entries exceeds 1000 and a Gentoo developer > >>> maintains the package > >>> 3.) the number of EGO_SUM entries exceeds 1500 and a proxied > >>> maintainer maintains the package > >> > >> These numbers seem quite large, compared to the mean number of 3.4 > >> distfiles for packages in the Gentoo repository. (The median and the > >> 99-percentile are 1 and 22, respectively.) > > The numbers may appear large when compared to the whole tree, but I > think a fair comparison would be within the related programming language > ecosystem, e.g., Golang or Rust. > > For example, analyzing ::gentoo yields the following histogram for > 2022-01-01: > https://dev.gentoo.org/~flow/ego_sum_entries_histogram-2020-01-01.png > > > > To stay with your example, restic has a 300k manifest, multiple 30k+ > > ebuilds and897 distfiles. > > > > I'm thinking the limit would have to be much lower. Say, around 256 > > entries in EGO_SUM_SRC_URI. > > A limit of 256 appears to be to low to be of any use. It is slightly > above the 50th percentile, half of the packages could not use it. > > We have to realize that programming language ecosystems that only build > static binaries tend to produce software projects that have a large > number of dependencies. For example, app-misc/broot, a tool written in > Rust, has currently 310 entries in its Manifest. Why should we threat > one programming language different from another? Will be see voices that > ask for banning Rust packages in ::gentoo in the future? With the rising > popularity of Golang and Rust, we will (hopefully) only ever see an > increase of such packages in ::gentoo. And most existing packages in > this category will at best keep their dependency count constant, but are > also likely to accumulate further dependencies over time. I tend to agree with you honestly. I worked with Zac to come up with a different proposal which would allow upstream tooling for all languages that do this to work, but so far it is meeting resistance [1]. I will go back and add more information to that bug, but it will be later today before I can do that. I want to develop a poc to answer the statement that these would be live ebuilds if we allowed that. > And quite frankly, I don't see a problem with "large" Manifests and/or > ebuilds. Yes, it means our FTPs are hosting many files, in some cases > even many small files. And yes, it means that in some cases ebuild > parsing takes a bit longer. But I spoke with a few developers in the > past few months and was not presented with any real world issues that > EGO_SUM caused. If someone wants to fill in here, then now is a good > time to speak up. But my impression is that the arguments against > EGO_SUM are mostly of cosmetic nature. Again, please correct me if I am > wrong. I can't name any specific examples at the moment, but I have gotten some complaints about how long it takes to download and build go packages with hundreds of dependencies. Other than that, I'm not the one who voiced the problem originally, so we definitely need others to speak up. William [1] https://bugs.gentoo.org/833567 signature.asc Description: PGP signature
Re: [gentoo-dev] Proposal to undeprecate EGO_SUM
Hi, When the size of the repo is considered too big maybe we can revisit the option of having the portage tree distributed as a compressed sqashfs image. $ du -hs /var/db/repos/gentoo 536M. $ gensquashfs -k -q -b 1M -D /var/db/repos/gentoo -c zstd -X level=22 /tmp/gentoo-current.zstd.sqfs $ du -h /tmp/gentoo-current.zstd.sqfs 47M /tmp/gentoo-current.zstd.sqfs Though that would probably open another can of worms around incremental updates to the portage tree, or more precisely the lack of it (i.e. increased bandwidth requirements). Regardless, as a proxied maintainer I agree with Flow's point of view here (I think I have expressed these in detail too in the past here) and would prefer undeprecating EGO_SUM. Zoltan On Fri, Sep 30, 2022 at 05:10:10PM +0200, Jaco Kroon wrote: > Hi, > > On 2022/09/30 16:53, Florian Schmaus wrote: > > jkroon@plastiekpoot ~ $ du -sh /var/db/repos/gentoo/ > >> 644M /var/db/repos/gentoo/ > >> > >> I'm not against exploding this by another 200 or even 300 MB personally, > >> but I do agree that pointless bloat is bad, and ideally we want to > >> shrink the size requirements of the portage tree rather than enlarge. > > > > What is the problem if it is 400 MB more? ? What if we double the > > size? Would something break for you? Does that mean we should not add > > more packages to ::gentoo? Where do you draw the line? Would you > > rather have interested persons contribute to Gentoo or drive them away > > due the struggle that the EGO_SUM deprecation causes? > How long is a piece of string? > > I agree with you entirely. But if the tree gets to 10GB? > > At some point it may be worthwhile to split the tree similar to what > Debian does (or did, haven't checked in a while) where there is a core, > non-core repo etc ... except I suspect it may be better to split into > classes of packages, eg, x11 (aka desktop) style packages etc, and keep > ::gentoo primarily to system stuff (which is also getting harder and > harder to define). And this also makes it harder for maintainers. And > this is really already what separate overlays does except the don't (as > far as I know) have the rigorous QA that ::gentoo has. > > But again - at what point do you do this - and this also adds extra > burden on maintainers and developers alike. > > And of course I could set a filter to not even --sync say /x11-* at > all. For example. Or /dev-go or /dev-php etc ... > > So perhaps you're right, this is a moot discussion. Perhaps we should > just say let's solve the problem when (if?) people complain the tree is > too big. No, I'm not being sarcastic, just blunt (; > > The majority of Gentoo users (in my experience) are probably of the > developer oriented mindset either way, or have very specific itches that > need scratching that's hard to scratch with other distributions. Let's > face it, Gentoo to begin with should probably not be considered an > "easy" distribution. But it is a highly flexible, pro-choice, extremely > customizable, rolling release distribution. Which scratches my itch. > > Incidentally, the only categories currently to individually exceed 10MB > are these: > > 11M media-libs > 11M net-misc > 12M dev-util > 13M dev-ruby > 16M dev-libs > 30M dev-perl > 31M dev-python > > And by far the biggest consumer of space: > > 124M metadata > > Kind Regards, > Jaco >
Re: [gentoo-dev] Proposal to undeprecate EGO_SUM
Hi, On 2022/09/30 16:53, Florian Schmaus wrote: > jkroon@plastiekpoot ~ $ du -sh /var/db/repos/gentoo/ >> 644M /var/db/repos/gentoo/ >> >> I'm not against exploding this by another 200 or even 300 MB personally, >> but I do agree that pointless bloat is bad, and ideally we want to >> shrink the size requirements of the portage tree rather than enlarge. > > What is the problem if it is 400 MB more? ? What if we double the > size? Would something break for you? Does that mean we should not add > more packages to ::gentoo? Where do you draw the line? Would you > rather have interested persons contribute to Gentoo or drive them away > due the struggle that the EGO_SUM deprecation causes? How long is a piece of string? I agree with you entirely. But if the tree gets to 10GB? At some point it may be worthwhile to split the tree similar to what Debian does (or did, haven't checked in a while) where there is a core, non-core repo etc ... except I suspect it may be better to split into classes of packages, eg, x11 (aka desktop) style packages etc, and keep ::gentoo primarily to system stuff (which is also getting harder and harder to define). And this also makes it harder for maintainers. And this is really already what separate overlays does except the don't (as far as I know) have the rigorous QA that ::gentoo has. But again - at what point do you do this - and this also adds extra burden on maintainers and developers alike. And of course I could set a filter to not even --sync say /x11-* at all. For example. Or /dev-go or /dev-php etc ... So perhaps you're right, this is a moot discussion. Perhaps we should just say let's solve the problem when (if?) people complain the tree is too big. No, I'm not being sarcastic, just blunt (; The majority of Gentoo users (in my experience) are probably of the developer oriented mindset either way, or have very specific itches that need scratching that's hard to scratch with other distributions. Let's face it, Gentoo to begin with should probably not be considered an "easy" distribution. But it is a highly flexible, pro-choice, extremely customizable, rolling release distribution. Which scratches my itch. Incidentally, the only categories currently to individually exceed 10MB are these: 11M media-libs 11M net-misc 12M dev-util 13M dev-ruby 16M dev-libs 30M dev-perl 31M dev-python And by far the biggest consumer of space: 124M metadata Kind Regards, Jaco
Re: [gentoo-dev] Proposal to undeprecate EGO_SUM
On 30/09/2022 16.36, Jaco Kroon wrote: Hi All, This doesn't directly affect me. Nor am I familiar with the mechanisms. Perhaps it's worthwhile to suggest that EGO_SUM itself may be externalized. I don't know what goes in here, and this will likely require help from portage itself, so may not be directly viable. What if portage had a feature whereby a SRC_URI list could be downloaded as a SRC_URI itself? In other words: SRC_URI_INDIRECT="https://wherever/lists_for_some_go_package.txt; That idea pops-up every time this is discussed. I don't see something like that anytime soon implemented in portage (please correct me if wrong) and it means that the ebuild development workflow requires some adjustments, to keep it as convenient as it currently is (but nothing couldn't be abstracted away by good tooling, i.e., pkgdev). jkroon@plastiekpoot ~ $ du -sh /var/db/repos/gentoo/ 644M /var/db/repos/gentoo/ I'm not against exploding this by another 200 or even 300 MB personally, but I do agree that pointless bloat is bad, and ideally we want to shrink the size requirements of the portage tree rather than enlarge. What is the problem if it is 400 MB more? ? What if we double the size? Would something break for you? Does that mean we should not add more packages to ::gentoo? Where do you draw the line? Would you rather have interested persons contribute to Gentoo or drive them away due the struggle that the EGO_SUM deprecation causes? - Flow OpenPGP_0x8CAC2A9678548E35.asc Description: OpenPGP public key OpenPGP_signature Description: OpenPGP digital signature
Re: [gentoo-dev] Proposal to undeprecate EGO_SUM
On 30/09/2022 02.36, William Hubbs wrote: On Wed, Sep 28, 2022 at 06:31:39PM +0200, Ulrich Mueller wrote: On Wed, 28 Sep 2022, Florian Schmaus wrote: 2.) the number of EGO_SUM entries exceeds 1000 and a Gentoo developer maintains the package 3.) the number of EGO_SUM entries exceeds 1500 and a proxied maintainer maintains the package These numbers seem quite large, compared to the mean number of 3.4 distfiles for packages in the Gentoo repository. (The median and the 99-percentile are 1 and 22, respectively.) The numbers may appear large when compared to the whole tree, but I think a fair comparison would be within the related programming language ecosystem, e.g., Golang or Rust. For example, analyzing ::gentoo yields the following histogram for 2022-01-01: https://dev.gentoo.org/~flow/ego_sum_entries_histogram-2020-01-01.png To stay with your example, restic has a 300k manifest, multiple 30k+ ebuilds and897 distfiles. I'm thinking the limit would have to be much lower. Say, around 256 entries in EGO_SUM_SRC_URI. A limit of 256 appears to be to low to be of any use. It is slightly above the 50th percentile, half of the packages could not use it. We have to realize that programming language ecosystems that only build static binaries tend to produce software projects that have a large number of dependencies. For example, app-misc/broot, a tool written in Rust, has currently 310 entries in its Manifest. Why should we threat one programming language different from another? Will be see voices that ask for banning Rust packages in ::gentoo in the future? With the rising popularity of Golang and Rust, we will (hopefully) only ever see an increase of such packages in ::gentoo. And most existing packages in this category will at best keep their dependency count constant, but are also likely to accumulate further dependencies over time. And quite frankly, I don't see a problem with "large" Manifests and/or ebuilds. Yes, it means our FTPs are hosting many files, in some cases even many small files. And yes, it means that in some cases ebuild parsing takes a bit longer. But I spoke with a few developers in the past few months and was not presented with any real world issues that EGO_SUM caused. If someone wants to fill in here, then now is a good time to speak up. But my impression is that the arguments against EGO_SUM are mostly of cosmetic nature. Again, please correct me if I am wrong. - Flow OpenPGP_0x8CAC2A9678548E35.asc Description: OpenPGP public key OpenPGP_signature Description: OpenPGP digital signature
Re: [gentoo-dev] Proposal to undeprecate EGO_SUM
Hi All, This doesn't directly affect me. Nor am I familiar with the mechanisms. Perhaps it's worthwhile to suggest that EGO_SUM itself may be externalized. I don't know what goes in here, and this will likely require help from portage itself, so may not be directly viable. What if portage had a feature whereby a SRC_URI list could be downloaded as a SRC_URI itself? In other words: SRC_URI_INDIRECT="https://wherever/lists_for_some_go_package.txt; Where that file itself contains lines for entries that would normally go into SRC_URI (directly or indirectly via EGO_SUM from what I can deduce). Something like: https://www.upstream.com/downloads/package-version.tar.gz => fneh.tar.gz|manifest portion goes here Where manifest portion would assume DIST and fneh.tar.gz, so would start with the filesize in bytes, followed by checksum value pairs as per current Manifest files. Since users may want to know how big the downloads for a specific ebuild is, some process to generate these external manifests may be in order, and to subsequently store the size of these indirect downloads themselves in the local manifest, so in the local Manifest, something like: IDIST lists_for_some_go_package.txt direct_size indirect_size CHECKSUM value CHECKSUM value. I realise this idea isn't immediately feasible, and perhaps not at all, presented here since perhaps it could spark an idea for someone else. It sounds like this is the problem that the vendor tarball tries to solve, but that that introduces a trust issue - not sure this exactly goes away but at a minimum we're now verifying download locations again (as per EGO_SUM or just SRC_URI in general) rather than code tarballs containing many many times more code than download locations. Given: jkroon@plastiekpoot ~ $ du -sh /var/db/repos/gentoo/ 644M /var/db/repos/gentoo/ I'm not against exploding this by another 200 or even 300 MB personally, but I do agree that pointless bloat is bad, and ideally we want to shrink the size requirements of the portage tree rather than enlarge. Kind Regards, Jaco On 2022/09/30 15:57, Florian Schmaus wrote: > On 28/09/2022 23.23, John Helmert III wrote: >> On Wed, Sep 28, 2022 at 05:28:00PM +0200, Florian Schmaus wrote: >>> I would like to continue discussing whether we should entirely >>> deprecate >>> EGO_SUM without the desire to offend anyone. >>> >>> We now have a pending GitHub PR that bumps restic to 0.14 [1]. >>> Restic is >>> a very popular backup software written in Go. The PR drops EGO_SUM in >>> favor of a vendor tarball created by the proxied maintainer. However, I >>> am unaware of any tool that lets you practically audit the 35 MiB >>> source >>> contained in the tarball. And even if such a tool exists, this would >>> mean another manual step is required, which is, potentially, skipped >>> most of the time, weakening our user's security. This is because I >>> believe neither our tooling, e.g., go-mod.eclass, nor any Golang >>> tooling, does authenticate the contents of the vendor tarball against >>> upstream's go.sum. But please correct me if I am wrong. >>> >>> I wonder if we can reach consensus around un-depreacting EGO_SUM, but >>> discouraging its usage in certain situations. That is, provide EGO_SUM >>> as option but disallow its use if >>> 1.) *upstream* provides a vendor tarball >>> 2.) the number of EGO_SUM entries exceeds 1000 and a Gentoo developer >>> maintains the package >>> 3.) the number of EGO_SUM entries exceeds 1500 and a proxied maintainer >>> maintains the package >> >> I'm not sure I agree on these limits, given the authenticity problem >> exists regardless of how many dependencies there are. > > It's not really about authentication, you always have to trust > upstream to some degree (unless you audit every line of code). But I > believe that code distributed via official channels is viewed by more > eyes and significantly more secure. > > EGO_SUM entries are directly fetched from the official distribution > channels of Golang. Hence, there is a higher chance that malicious > code in one of those is detected faster, simply because they are > consumed by more entities. Compared to the dependency tarball that is > just used by Gentoo. In contrast to the official sources, "nobody" is > looking at the code inside the tarball. > > For proxied packages, where the dependency tarball is published by the > proxied maintainer, the tarball also allows another entity to inject > code into the final result of the package. And compared to a few small > patches in FILESDIR, such a dependency tarball requires more effort to > review. This further weakens security in comparison to EGO_SUM. > > - Flow
Re: [gentoo-dev] Proposal to undeprecate EGO_SUM
On 28/09/2022 23.23, John Helmert III wrote: On Wed, Sep 28, 2022 at 05:28:00PM +0200, Florian Schmaus wrote: I would like to continue discussing whether we should entirely deprecate EGO_SUM without the desire to offend anyone. We now have a pending GitHub PR that bumps restic to 0.14 [1]. Restic is a very popular backup software written in Go. The PR drops EGO_SUM in favor of a vendor tarball created by the proxied maintainer. However, I am unaware of any tool that lets you practically audit the 35 MiB source contained in the tarball. And even if such a tool exists, this would mean another manual step is required, which is, potentially, skipped most of the time, weakening our user's security. This is because I believe neither our tooling, e.g., go-mod.eclass, nor any Golang tooling, does authenticate the contents of the vendor tarball against upstream's go.sum. But please correct me if I am wrong. I wonder if we can reach consensus around un-depreacting EGO_SUM, but discouraging its usage in certain situations. That is, provide EGO_SUM as option but disallow its use if 1.) *upstream* provides a vendor tarball 2.) the number of EGO_SUM entries exceeds 1000 and a Gentoo developer maintains the package 3.) the number of EGO_SUM entries exceeds 1500 and a proxied maintainer maintains the package I'm not sure I agree on these limits, given the authenticity problem exists regardless of how many dependencies there are. It's not really about authentication, you always have to trust upstream to some degree (unless you audit every line of code). But I believe that code distributed via official channels is viewed by more eyes and significantly more secure. EGO_SUM entries are directly fetched from the official distribution channels of Golang. Hence, there is a higher chance that malicious code in one of those is detected faster, simply because they are consumed by more entities. Compared to the dependency tarball that is just used by Gentoo. In contrast to the official sources, "nobody" is looking at the code inside the tarball. For proxied packages, where the dependency tarball is published by the proxied maintainer, the tarball also allows another entity to inject code into the final result of the package. And compared to a few small patches in FILESDIR, such a dependency tarball requires more effort to review. This further weakens security in comparison to EGO_SUM. - Flow OpenPGP_0x8CAC2A9678548E35.asc Description: OpenPGP public key OpenPGP_signature Description: OpenPGP digital signature