Re: git repositories vs. tarballs

2024-04-15 Thread Simon Josefsson via Gnulib discussion list
Bruno Haible  writes:

> Hi Simon,
>
> In the other thread [1][2][2a], but see also [3] and [4], you are asking

Hi Bruno -- thanks for attempting to bring som order to this complicated
matter!  I am in agreement with most of what you say, although some
comments below.

>> Has this changed, so we should recommend maintainers
>> to 'EXTRA_DIST = bootstrap bootstrap-funclib.sh bootstrap.conf' so this
>> is even possible?
>
> 1) I think changing the contents of a tarball ad-hoc, like this, will not
>lead to satisfying results, because too many packages will do things
>differently.

Right, and people tend to jump to the incorrect conclusion that running
autoreconf -fvi or running ./bootstrap from a tarball is a good idea.

Rather than trying to fix that solution, I think we should guide these
people towards using 'git-archive' style tarballs instead.  Then they
will need to do all the work that is actually required to bootstrap a
project, including getting all the dependencies in place.

Some will succeed in that.

Some will give up and realize they wanted the traditional curated
tarball after all, and go back to it, and this time hopefully not do the
'autoreconf -fi' harmful dance.

In both situations, I think we are better off than with the current
situation.  Now people take the 'make dist' tarballs and try to reverse
engineer all the required dependencies to regenerate all artifacts, and
do a half-baked job at that, with an end result that is even harder to
audit than what we started with.

>(Y) Some distros want to be able to verify the tarballs.[9] (I don't agree
>with this. If you can't trust the release manager who produced the
>tarballs (C), you cannot trust (A) either. If there is a mechanism
>for verifying (C) from (A), criminals will commit their malware
>entirely into (A).)

I have another perspective here.  I don't think people necessarily want
to blindly trust either the git repository source code (A) or tarball
with generated code and source code (C).  So people will want the
ability to audit and verify everything.  Once people start to work on
auditing, they realize that there is no way around auditing (A).  You
need to audit XZUtils source code to gain trust in XZUtils.  So people
work on doing that.  Then someone realize that people aren't actually
using git source code (A) to build the XZUtils binaries -- they are
using (A) plus generated content, that is the full tarball (C).  However
auditing (C) is just a waste of human time if there was a way to avoid
using (C) completely, and have people use (A) directly.  This isn't all
that complicated, I just did it for Libntlm and will try to do the same
for other packages.

I think you are right that if we succeed with this, criminals will put
their malware directly into git source code repositories.  However that
is addressed by the people working on reviewing the core code of
projects.  There is no longer any need for people to spend time auditing
tarballs with a lot of other stuff in them.  This time can be redirected
towards auditing the code.  Which over the years saves a lot of human
cycles.

Most code audits I've seen focus on what's in git, not what's in the
tarball nor in the binary packages that people use.  Which is how it
should be -- the build environment is better to audit on its own rathen
than as part of the upstream code audit.

> 6) How could (X) be implemented?
>
>The main differences between (A) and (C) are [10]:
>  - Tarballs contain source code from other packages.
>  - Tarballs contain generated files.
>  - Tarballs contain localizations.
>
>I could imagine an intermediate step between (A) and (C):
>
>  (B) is for users with many packages installed and for distros, to apply
>  modifications (even to the set of gnulib modules) and then build
>  binaries of the package for one or more architectures, without
>  needing to fetch anything (other than build prerequisites) from the
>  network.
>
>This is a different stage than (A), because most developers don't want
>to commit source code from other packages into (A) — due to size — nor
>to commit generated files into (A) — due to hassles with branches.
>
>Going from (A) to (B) means pulling additional sources from the network.
>It could be implemented
>  - by "git submodule update --init", or
>  - by 'npm' for JavaScript packages, or
>  - by 'cargo' for Rust packages [11]
>and, for the localizations:
>  - essentially by a 'wget' command that fetches the *.po files.
>
>The proposed name of a script that does this is 'autopull.sh'.
>But I am equally open to a declarative YAML file instead of a shell script.

Another point of view is to give up on forcing the autopull part on
users -- instead we can mention the required dependencies in README and
let the user/packager worry about having them available.  At least as an
option.

The reason 

git repositories vs. tarballs

2024-04-14 Thread Bruno Haible
Hi Simon,

In the other thread [1][2][2a], but see also [3] and [4], you are asking

> Has this changed, so we should recommend maintainers
> to 'EXTRA_DIST = bootstrap bootstrap-funclib.sh bootstrap.conf' so this
> is even possible?

1) I think changing the contents of a tarball ad-hoc, like this, will not
   lead to satisfying results, because too many packages will do things
   differently.

   Instead, we should ask the question "for which purposes is 
   going to be used?" or "which operations are supported on ?".
   Once there is agreement on this question, the contents of the artifact
   will necessarily follow.

2) When considering
 (A) git repositories (or tar.gz files containing their contents,
 e.g. the "snapshot" on 
https://git.savannah.gnu.org/gitweb/?p=PACKAGE.git
 or the "Download ZIP" on https://github.com/TEAM/PACKAGE),
 (C) a tarball as published on ftp.gnu.org,
   it is also useful to consider
 (E) a binary package .tar.gz / .rpm / .deb
   because there is already a lot of experience for doing "reproducible
   builds" from (C) to (E) [5][6].

3) So, what are the purposes of (A), (C), (E)?

   So far, it has been
 (A) is for users with developer skills, the preferred way to work
 with the source code, including branching and merging of branches.
 (C) is for users and distros, to apply relatively small modifications
 and then build binaries of the package for one or more architectures,
 without needing to fetch anything (other than build prerequisites)
 from the network.
 (E) is for users, to install the package on a specific machine, without
 needing development tools.

4) What do the reproducible builds from (C) to (E) mean? The purpose of (E)
   changes to
 (E+) Like (E), plus:
  A user _with_ development tools can determine whether (E) was
  built with a published build recipe, without tampering.
   Note that this requires
 - formalizing the notion of a build environment [7],
 - adding this build environment into (E) (not yet complete for Debian [8]).

5) There are two wishes that are not yet satisfied by (A) and (C):
   (X) Many users without developer skills are turning to the git repository
   and trying to build from there.
   (Y) Some distros want to be able to verify the tarballs.[9] (I don't agree
   with this. If you can't trust the release manager who produced the
   tarballs (C), you cannot trust (A) either. If there is a mechanism
   for verifying (C) from (A), criminals will commit their malware
   entirely into (A).)

6) How could (X) be implemented?

   The main differences between (A) and (C) are [10]:
 - Tarballs contain source code from other packages.
 - Tarballs contain generated files.
 - Tarballs contain localizations.

   I could imagine an intermediate step between (A) and (C):

 (B) is for users with many packages installed and for distros, to apply
 modifications (even to the set of gnulib modules) and then build
 binaries of the package for one or more architectures, without
 needing to fetch anything (other than build prerequisites) from the
 network.

   This is a different stage than (A), because most developers don't want
   to commit source code from other packages into (A) — due to size — nor
   to commit generated files into (A) — due to hassles with branches.

   Going from (A) to (B) means pulling additional sources from the network.
   It could be implemented
 - by "git submodule update --init", or
 - by 'npm' for JavaScript packages, or
 - by 'cargo' for Rust packages [11]
   and, for the localizations:
 - essentially by a 'wget' command that fetches the *.po files.

   The proposed name of a script that does this is 'autopull.sh'.
   But I am equally open to a declarative YAML file instead of a shell script.

   Going from (B) to (C) means generating files, through invocations of
   gnulib-tool, bison, flex, ... for the code and groff, texinfo, doxygen, ...
   for the documentation.

   The proposed name of a script that does this is 'autogen.sh'.

7) How could (Y) be implemented?
   Like in (E+), we would define:

 (C+) Like (C), plus:
  A user with all kinds of special tools can determine whether (C)
  was built with a published build recipe, without tampering.

   Again, this requires
 - formalizing the notion of a build environment,
 - adding this build environment into (C).

   For example, we would need a way to specify a build dependency on a
   particular version of groff or texinfo or doxygen (for the documentation),
   a particular version of m4, autoconf, automake (for the configure script
   and Makefile.ins).

   So far, some people have published their build environment in form of
   ad-hoc plain text ("This release was bootstrapped with the following tools")
   inside release announcements. [12] Of course, that's