Re: git repositories vs. tarballs
Bruno Haible writes: > Hi Simon, > > In the other thread [1][2][2a], but see also [3] and [4], you are asking Hi Bruno -- thanks for attempting to bring som order to this complicated matter! I am in agreement with most of what you say, although some comments below. >> Has this changed, so we should recommend maintainers >> to 'EXTRA_DIST = bootstrap bootstrap-funclib.sh bootstrap.conf' so this >> is even possible? > > 1) I think changing the contents of a tarball ad-hoc, like this, will not >lead to satisfying results, because too many packages will do things >differently. Right, and people tend to jump to the incorrect conclusion that running autoreconf -fvi or running ./bootstrap from a tarball is a good idea. Rather than trying to fix that solution, I think we should guide these people towards using 'git-archive' style tarballs instead. Then they will need to do all the work that is actually required to bootstrap a project, including getting all the dependencies in place. Some will succeed in that. Some will give up and realize they wanted the traditional curated tarball after all, and go back to it, and this time hopefully not do the 'autoreconf -fi' harmful dance. In both situations, I think we are better off than with the current situation. Now people take the 'make dist' tarballs and try to reverse engineer all the required dependencies to regenerate all artifacts, and do a half-baked job at that, with an end result that is even harder to audit than what we started with. >(Y) Some distros want to be able to verify the tarballs.[9] (I don't agree >with this. If you can't trust the release manager who produced the >tarballs (C), you cannot trust (A) either. If there is a mechanism >for verifying (C) from (A), criminals will commit their malware >entirely into (A).) I have another perspective here. I don't think people necessarily want to blindly trust either the git repository source code (A) or tarball with generated code and source code (C). So people will want the ability to audit and verify everything. Once people start to work on auditing, they realize that there is no way around auditing (A). You need to audit XZUtils source code to gain trust in XZUtils. So people work on doing that. Then someone realize that people aren't actually using git source code (A) to build the XZUtils binaries -- they are using (A) plus generated content, that is the full tarball (C). However auditing (C) is just a waste of human time if there was a way to avoid using (C) completely, and have people use (A) directly. This isn't all that complicated, I just did it for Libntlm and will try to do the same for other packages. I think you are right that if we succeed with this, criminals will put their malware directly into git source code repositories. However that is addressed by the people working on reviewing the core code of projects. There is no longer any need for people to spend time auditing tarballs with a lot of other stuff in them. This time can be redirected towards auditing the code. Which over the years saves a lot of human cycles. Most code audits I've seen focus on what's in git, not what's in the tarball nor in the binary packages that people use. Which is how it should be -- the build environment is better to audit on its own rathen than as part of the upstream code audit. > 6) How could (X) be implemented? > >The main differences between (A) and (C) are [10]: > - Tarballs contain source code from other packages. > - Tarballs contain generated files. > - Tarballs contain localizations. > >I could imagine an intermediate step between (A) and (C): > > (B) is for users with many packages installed and for distros, to apply > modifications (even to the set of gnulib modules) and then build > binaries of the package for one or more architectures, without > needing to fetch anything (other than build prerequisites) from the > network. > >This is a different stage than (A), because most developers don't want >to commit source code from other packages into (A) — due to size — nor >to commit generated files into (A) — due to hassles with branches. > >Going from (A) to (B) means pulling additional sources from the network. >It could be implemented > - by "git submodule update --init", or > - by 'npm' for JavaScript packages, or > - by 'cargo' for Rust packages [11] >and, for the localizations: > - essentially by a 'wget' command that fetches the *.po files. > >The proposed name of a script that does this is 'autopull.sh'. >But I am equally open to a declarative YAML file instead of a shell script. Another point of view is to give up on forcing the autopull part on users -- instead we can mention the required dependencies in README and let the user/packager worry about having them available. At least as an option. The reason
git repositories vs. tarballs
Hi Simon, In the other thread [1][2][2a], but see also [3] and [4], you are asking > Has this changed, so we should recommend maintainers > to 'EXTRA_DIST = bootstrap bootstrap-funclib.sh bootstrap.conf' so this > is even possible? 1) I think changing the contents of a tarball ad-hoc, like this, will not lead to satisfying results, because too many packages will do things differently. Instead, we should ask the question "for which purposes is going to be used?" or "which operations are supported on ?". Once there is agreement on this question, the contents of the artifact will necessarily follow. 2) When considering (A) git repositories (or tar.gz files containing their contents, e.g. the "snapshot" on https://git.savannah.gnu.org/gitweb/?p=PACKAGE.git or the "Download ZIP" on https://github.com/TEAM/PACKAGE), (C) a tarball as published on ftp.gnu.org, it is also useful to consider (E) a binary package .tar.gz / .rpm / .deb because there is already a lot of experience for doing "reproducible builds" from (C) to (E) [5][6]. 3) So, what are the purposes of (A), (C), (E)? So far, it has been (A) is for users with developer skills, the preferred way to work with the source code, including branching and merging of branches. (C) is for users and distros, to apply relatively small modifications and then build binaries of the package for one or more architectures, without needing to fetch anything (other than build prerequisites) from the network. (E) is for users, to install the package on a specific machine, without needing development tools. 4) What do the reproducible builds from (C) to (E) mean? The purpose of (E) changes to (E+) Like (E), plus: A user _with_ development tools can determine whether (E) was built with a published build recipe, without tampering. Note that this requires - formalizing the notion of a build environment [7], - adding this build environment into (E) (not yet complete for Debian [8]). 5) There are two wishes that are not yet satisfied by (A) and (C): (X) Many users without developer skills are turning to the git repository and trying to build from there. (Y) Some distros want to be able to verify the tarballs.[9] (I don't agree with this. If you can't trust the release manager who produced the tarballs (C), you cannot trust (A) either. If there is a mechanism for verifying (C) from (A), criminals will commit their malware entirely into (A).) 6) How could (X) be implemented? The main differences between (A) and (C) are [10]: - Tarballs contain source code from other packages. - Tarballs contain generated files. - Tarballs contain localizations. I could imagine an intermediate step between (A) and (C): (B) is for users with many packages installed and for distros, to apply modifications (even to the set of gnulib modules) and then build binaries of the package for one or more architectures, without needing to fetch anything (other than build prerequisites) from the network. This is a different stage than (A), because most developers don't want to commit source code from other packages into (A) — due to size — nor to commit generated files into (A) — due to hassles with branches. Going from (A) to (B) means pulling additional sources from the network. It could be implemented - by "git submodule update --init", or - by 'npm' for JavaScript packages, or - by 'cargo' for Rust packages [11] and, for the localizations: - essentially by a 'wget' command that fetches the *.po files. The proposed name of a script that does this is 'autopull.sh'. But I am equally open to a declarative YAML file instead of a shell script. Going from (B) to (C) means generating files, through invocations of gnulib-tool, bison, flex, ... for the code and groff, texinfo, doxygen, ... for the documentation. The proposed name of a script that does this is 'autogen.sh'. 7) How could (Y) be implemented? Like in (E+), we would define: (C+) Like (C), plus: A user with all kinds of special tools can determine whether (C) was built with a published build recipe, without tampering. Again, this requires - formalizing the notion of a build environment, - adding this build environment into (C). For example, we would need a way to specify a build dependency on a particular version of groff or texinfo or doxygen (for the documentation), a particular version of m4, autoconf, automake (for the configure script and Makefile.ins). So far, some people have published their build environment in form of ad-hoc plain text ("This release was bootstrapped with the following tools") inside release announcements. [12] Of course, that's