Re: Proposal for a blog contribution on reproducible computations

zimoun Thu, 09 Jan 2020 12:41:21 -0800

Hi Konrad,

Thank you! It is very interesting!!


Below questions.
And suggestions which I can Pull-Request with Github. :-)

Hope it is readable: indented text is your text; non-indented one is question.


Cheers,
simon

--


     #+TITLE: Reproducible computations with Guix
     #+STARTUP: inlineimages

     * Dependencies: what it takes to run a program


Move this section title below.


     This  post is  about reproducible  computations, so  let's start  with a
     computation.  A short, though rather  uninteresting, C program is a good
     starting point. It computes π in three different ways:
     #+begin_src c :tangle pi.c :eval no
     #include <math.h>
     #include <stdio.h>

     int main()
     {
         printf( "M_PI                         : %.10lf\n", M_PI);
         printf( "4 * atan(1.)                 : %.10lf\n", 4.*atan(1.));
         printf( "Leibniz' formula (four terms): %.10lf\n",
4.*(1.-1./3.+1./5.-1./7.));
         return 0;
     }
     #+end_src


Align ':' for easier looking.


     This program uses  no random element, such as a  random number generator
     or parallelism.  It's strictly deterministic. It is reasonable to expect
     it to produce exactly the same output,  on any computer and at any point
     in time.   And yet,  many programs whose  results /should/  be perfectly
     reproducible are in fact not.  Programs using floating-point arithmetic,
     such  as  this  short  example,  are  particularly  prone  to  seemingly
     inexplicable variations.

     My  goal is  to  explain why  deterministic programs  often  fail to  be
     reproducible, and  what it takes to  fix this. The short  answer to that
     question is "use Guix", but  even though Guix provides excellent support
     for  reproducibility, you  still  have  to use  it  correctly, and  that
     requires some understanding of what's  going on.  The explanation I will
     give is rather  detailed, to the point of discussing  parts of the Guile
     API of Guix. You should be  able to follow the reasoning without knowing
     Guile though, you will  just have to believe me that  the scripts I will
     show  do what  I  claim  they do.  And  in the  end,  I  will provide  a
     ready-to-run Guile script that will let you explore package dependencies
     right from the shell.


* Dependencies: what it takes to run a program


     One keyword in discussions of reproducibility is "dependencies".  I will
     revisit the exact meaning of this term later, but to get started, I will
     define it loosely  as "any software package required to  run a program".
     Running the π  computation shown above is normally  done using something
     like
     #+begin_src sh :exports code :eval no
     gcc pi.c -o pi && ./pi
     #+end_src


Missing '&&'. It does not work without on my machine.


     C programmers  know that =gcc=  is a C  compiler, so that's  one obvious
     dependency for running  our little program. But is a  C compiler enough?
     That  question is  surprisingly difficult  to answer  in practice.  Your
     computer is loaded with tons of  software (otherwise it wouldn't be very
     useful), and you  don't really know what happens behind  the scenes when
     you run =gcc= or =pi=.


** Container is good


     A major element of reproducibility support in Guix is the possibility to
     run  programs  in well-defined  environments  that  contain exactly  the
     software packages you request, and no  more.  So if your program runs in
     an environment that contains  only a C compiler, you can  be sure it has
     no other dependencies. Let's create such an environment:
     #+begin_src sh :session C-compiler :results output :exports both
     guix environment --container --ad-hoc gcc-toolchain
     #+end_src


     #+RESULTS:

     The option  =--container= ensures the  best possible isolation  from the
     standard  environment that  your  system installation  and user  account
     provide for day-to-day  work. This environment contains nothing  but a C
     compiler  and a  shell (which  you need  to type  in commands),  and has
     access to no other files than those in the current directory.

     Side  note: the  option =--container=  requires support  from the  Linux
     kernel that is not available on all systems. If it doesn't work for you,
     use =--pure= instead. It provides a less isolated environment, but it is
     usually more than good enough.


By default, I get:

--8<---------------cut here---------------start------------->8---
guix environment: error: cannot create container: unprivileged user
cannot create user namespaces
guix environment: error: please set
/proc/sys/kernel/unprivileged_userns_clone to "1"
--8<---------------cut here---------------end--------------->8---


Or a sentence explaining what to do. For example, "The =--container= option
requires allowing the kernel to clone for the unprivileged user, i.e., as
=root= just run the command
=echo 1 > /proc/sys/kernel/unprivileged_userns_clone=."


     The above  command leaves me in  a shell inside my  environment, where I
     can now compile and run my little program:
     #+begin_src sh :session C-compiler :results output :exports both
     gcc pi.c -o pi && ./pi
     #+end_src


Missing again '&&'. Sorry if it is me.


     #+RESULTS:
     : M_PI                         : 3.14159265358979311600
     : 4 * atan(1.)                 : 3.14159265358979311600
     : Leibniz' formula (four terms): 2.89523809523809561028

     It works! So now I can be  sure that my program has a single dependency:
     the Guix  package =gcc-toolchain=.   Perfectionists who want  to exclude
     the possibility that my program requires  a shell could run each step in
     a separate container:
     #+begin_src sh :results output :exports both
     guix environment --container --ad-hoc gcc-toolchain -- gcc pi.c -o pi
     guix environment --container --ad-hoc gcc-toolchain -- ./pi
     #+end_src

     #+RESULTS:
     : M_PI                         : 3.14159265358979311600
     : 4 * atan(1.)                 : 3.14159265358979311600
     : Leibniz' formula (four terms): 2.89523809523809561028


** Let open the dependencies hell


     Now that we know that our only dependency is =gcc-toolchain=, let's look
     at it in more detail:

     #+begin_src sh :results output :exports both
     guix show gcc-toolchain
     #+end_src

     #+RESULTS:
     #+begin_example
     name: gcc-toolchain
     version: 9.2.0
     outputs: out debug static
     systems: x86_64-linux i686-linux
     dependencies: binutils@2.32 gcc@9.2.0 glibc@2.29 ld-wrapper@0
     location: gnu/packages/commencement.scm:2532:4
     homepage: https://gcc.gnu.org/
     license: GPL 3+
     synopsis: Complete GCC tool chain for C/C++ development
     description: This package provides a complete GCC tool chain for C/C++
     + development to be installed in user profiles.  This includes
GCC, as well as
     + libc (headers an d binaries, plus debugging symbols in the
`debug' output),
     + and Binutils.

     name: gcc-toolchain
     version: 8.3.0
     outputs: out debug static
     systems: x86_64-linux i686-linux
     dependencies: binutils@2.32 gcc@8.3.0 glibc@2.29 ld-wrapper@0
     location: gnu/packages/commencement.scm:2532:4
     homepage: https://gcc.gnu.org/
     license: GPL 3+
     synopsis: Complete GCC tool chain for C/C++ development
     description: This package provides a complete GCC tool chain for C/C++
     + development to be installed in user profiles.  This includes
GCC, as well as
     + libc (headers an d binaries, plus debugging symbols in the
`debug' output),
     + and Binutils.

[...]
     #+end_example

     Guix actually knows about several  versions of this toolchain. We didn't
     ask for a  specific one, so what we  got is the first one  in this list,
     which is the one with the  highest version number. Let's check that this
     is true:
     #+begin_src sh :results output :exports both
     guix environment --container --ad-hoc gcc-toolchain -- gcc --version
     #+end_src

     #+RESULTS:
     : gcc (GCC) 9.2.0
     : Copyright (C) 2019 Free Software Foundation, Inc.
     : This is free software; see the source for copying conditions.
There is NO
     : warranty; not even for MERCHANTABILITY or FITNESS FOR A
PAR1TICULAR PURPOSE.
     :

     The output of =guix show= contains  a line about dependencies. These are
     the dependencies  of our  dependency, and you  may already  have guessed
     that they will have dependencies as well.  That's why reproducibility is
     such   a    difficult   job    in   practice!   The    dependencies   of
     =gcc-toolchain@9.2.0= are:


Let use =recsel= and teach also how to filter the package output. :-)

#+begin_src sh :results output :exports both
guix show gcc-toolchain@9.2.0 | recsel -P dependencies
#+end_src

#+RESULTS:
: binutils@2.32 gcc@9.2.0 glibc@2.29 ld-wrapper@0


     #+begin_example
     binutils@2.32 gcc@9.2.0 glibc@2.29 ld-wrapper@0
     #+end_example

     To dig deeper, we can try feeding these dependencies to =guix show=, one
     by one, in order to learn more about them:
     #+begin_src sh :results output :exports both
     guix show binutils@2.32
     #+end_src

     #+RESULTS:
     #+begin_example
     name: binutils
     version: 2.32
     outputs: out
     systems: x86_64-linux i686-linux
     dependencies:
     location: gnu/packages/base.scm:415:2
     homepage: https://www.gnu.org/software/binutils/
     license: GPL 3+
     synopsis: Binary utilities: bfd gas gprof ld
     description: GNU Binutils is a collection of tools for working with binary
     + files.  Perhaps the most notable are "ld", a linker, and "as",
an assembler.
     + Other tools include programs to display binary profiling
information, list the
     + strings in a binary file, and utilities for working with
archives.  The "bfd"
     + library for working with executable and object formats is also included.

     #+end_example

     #+begin_src sh :results output :exports both
     exec 2>&1 guix show gcc@9.2.0 :
     #+end_src

     #+RESULTS:
     : guix show: error: gcc@9.2.0: package not found

     This looks  a bit  surprising. What's  happening here  is that  =gcc= is
     defined as a /hidden  package/ in Guix. The package is  there, but it is
     hidden from package queries.  There is  a good reason for this: =gcc= on
     its own is rather useless, you  need =gcc-toolchain= to actually use the
     compiler. But if  both =gcc= and =gcc-toolchain= showed up  in a search,
     that would  be more confusing  than helpful  for most users.  Hiding the
     package is a way of saying "for experts only".

     Let's take this as a sign that it's time to move on to the next level of
     Guix hacking:  Guile scripts.   Guile, an  implementation of  the Scheme
     language,  is Guix'  native language,  so using  Guile scripts,  you get
     access to everything there is to know about Guix and its packages.

     A note in passing: the
[[https://emacs-guix.gitlab.io/website/][emacs-guix]] package provides
an intermediate level
     of  Guix  exploration for  Emacs  users.  It  lets  you look  at  hidden
     packages, for  example. But much  of what I  will show in  the following
     really requires Guile scripts.

     * Anatomy of a Guix package

     From the user's point  of view, a package is a piece  of software with a
     name and  a version number that  can be installed using  =guix install=.
     The packager's  point of view is  quite a bit different.   In fact, what
     users consider a package is more precisely called the package's /output/
     in Guix jargon. The package is a recipe for creating this output.

     To see how all these concepts fit  together, let's look at an example of
     a package definition: =xmag=.  I have  chosen this package not because I
     care much about it, but because its definition is short while showcasing
     all the  features I want  to explain. You can  access it most  easily by
     typing =guix edit xmag=. Here is what you will see:
     #+begin_src scheme :eval no
     (package
       (name "xmag")
       (version "1.0.6")
       (source
        (origin
          (method url-fetch)
          (uri (string-append
                "mirror://xorg/individual/app/" name "-" version ".tar.gz"))
          (sha256
           (base32
            "19bsg5ykal458d52v0rvdx49v54vwxwqg8q36fdcsv9p2j8yri87"))))
       (build-system gnu-build-system)
       (arguments
        `(#:configure-flags
          (list (string-append "--with-appdefaultdir="
                               %output ,%app-defaults-dir))))
       (inputs
        `(("libxaw" ,libxaw)))
       (native-inputs
        `(("pkg-config" ,pkg-config)))
       (home-page "https://www.x.org/wiki/";)
       (synopsis "Display or capture a magnified part of a X11 screen")
       (description "Xmag displays and captures a magnified snapshot
of a portion
     of an X11 screen.")
       (license license:x11))
     #+end_src


After, a package (=glibc=) is used to show that the same package can produce
different outputs and this above example does not own the =outputs= fields.


     The package definition starts with  the name and version information you
     expected. Next comes =source=, which says  how to obtain the source code
     and  from where.  It  also provides  a  hash that  allows  to check  the
     integrity of the downloaded files.  The next four items, =build-system=,
     =arguments=,  =inputs=,  and   =native-inputs=  supply  the  information
     required for /building/ the package,  which is what creates its outputs.
     The remaining  items are documentation for  human consumption, important
     for other reasons  but not for reproducibility, so I  won't say any more
     about them.


Link to the documentation and/or the cookbook entry about Packaging.

http://guix.gnu.org/manual/devel/en/html_node/Defining-Packages.html#Defining-Packages
http://guix.gnu.org/cookbook/en/html_node/Packaging.html#Packaging


     The  example  package  definition  has =native-inputs=  in  addition  to
     "plain"  =inputs=. There's  a  third  variant, =propagated-inputs=,  but
     =xmag= doesn't  have any. The  differences between these  variants don't
     matter  for  my  topic, so  I  will  just  refer  to "inputs"  from  now
     on. Another  omission I will make  is the possibility to  define several
     outputs for a  package.  This is done for particularly  big packages, in
     order to reduce the footprint of  installations, but for the purposes of
     reproducibility, it's  OK to  treat all  outputs of  a package  a single
     unit.

     The following figure  illustrates how the various  pieces of information
     from a package  are used in the build process  (done explicitly by =guix
     build=, or  implicitly when  installing or  otherwise using  a package):

     [[file:guix-package.svg]]

     It  may  help to  translate  the  Guix jargon  to  the  vocabulary of  C
     programming:
     | Guix package | C program        |
     |--------------+------------------|
     | source code  | source code      |
     | inputs       | libraries        |
     | arguments    | compiler options |
     | build system | compiler         |
     | output       | executable       |

     Building a  package can  be considered a  generalization of  compiling a
     program. We  could in  fact create  a "GCC build  system" for  Guix that
     would simply run =gcc=. However, such  a build system would be of little
     practical use, since most real-life  software consists of more than just
     one C source code file,  and requires additional pre- or post-processing
     steps.  The  =gnu-build-system= used  in the example  is based  on tools
     such as =make= and =autoconf=, in addition to =gcc=.


     * Package exploration in Guile

     Guile uses a record type called =<package>= to represent packages, which
     is

[[https://git.savannah.gnu.org/cgit/guix.git/tree/guix/packages.scm#n249][=<package>=]]
(hyperlink). Let spread Scheme. :-)

Is the syntax highlighting available for Savannah?

     defined  in module  =(guix packages)=.   There  is also  a module  =(gnu
     packages)=, which contains  the actual package definitions  - be careful
     not

[[https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages][=(gnu
packages)=]] (hyperlink).

     to confuse the two (as I always  do). Here is a simple Guile script that
     shows some package information, much like the =guix show= command that I
     used earlier:
     #+begin_src scheme :results output
     (use-modules (guix packages)
                  (gnu packages))

     (define gcc-toolchain
       (specification->package "gcc-toolchain"))

     (format #t "Name   : ~a\n" (package-name gcc-toolchain))
     (format #t "Version: ~a\n" (package-version gcc-toolchain))
     (format #t "Inputs : ~a\n" (package-direct-inputs gcc-toolchain))
     #+end_src

     #+RESULTS:
     : Name   : gcc-toolchain
     : Version: 8.3.0
     : Inputs : ((gcc #<package gcc@8.3.0 gnu/packages/gcc.scm:509
3b969a0>) (ld-wrapper #<package ld-wrapper@0 gnu/packages/base.scm:551
43e6bb0>) (binutils #<package binutils@2.31.1
gnu/packages/bootstrap.scm:150 43df6e0>) (libc #<package glibc@2.28
gnu/packages/commencement.scm:681 43df8f0>) (libc-debug #<package
glibc@2.28 gnu/packages/commencement.scm:681 43df8f0> debug)
(libc-static #<package glibc@2.28 gnu/packages/commencement.scm:681
43df8f0> static))


I would add something about =guix repl=. For example, "You can launch an
interactive REPL with =guix repl= and type directly these lines inside."

Adding also a footnote saying to add
#+begin_src scheme
(use-modules
 (ice-9 readline)
 (ice-9 format)
 (ice-9 pretty-print))
(activate-readline)
#+end_src
in =~/.guile= to ease the REPL experience.


     This script first calls =specification->package=  to look up the package
     using the  same rules  as the  =guix= command  line interface:  pick the
     latest  available  version if  none  is  explicitly requested.  Then  it
     extracts   various   information   about   the   package.    Note   that
     =package-direct-inputs=  returns  the combination  of  =package-inputs=,
     =package-native-inputs=,  and  =package-propagated-inputs=.  As  I  said
     above, I don't care about the distinction here.

     The inputs are not shown in a particularly nice form, so let's write two
     Guile functions to improve it:
     #+begin_src scheme :results output
     (use-modules (guix packages)
                  (gnu packages)
                  (ice-9 match))

     (define (package->specification package)
       (format #f "~a@~a"
               (package-name package)
               (package-version package)))

     (define (input->specification input)
       (match input
         ((label (? package? package) . _)
          (package->specification package))
         (other-item
          (format #f "~a" other-item))))

     (define gcc-toolchain
       (specification->package "gcc-toolchain"))

     (format #t "Package: ~a\n"
             (package->specification gcc-toolchain))
     (format #t "Inputs : ~a\n"
             (map input->specification (package-direct-inputs gcc-toolchain)))
     #+end_src

     #+RESULTS:
     : Package: gcc-toolchain@8.3.0
     : Inputs : (gcc@8.3.0 ld-wrapper@0 binutils@2.31.1 glibc@2.28
glibc@2.28 glibc@2.28)

     That looks much better.  As you can see from the code,  a list of inputs
     is a bit more than a list of  packages. It is in fact a list of labelled
     /package outputs/. That also explains why  we see =glibc= three times in
     the input list: =glibc= defines three distinct outputs, all of which are
     used in =gcc-toolchain=.

It is not clear to me why there is 3 times =glibc=. Instead, I propose this.

#+begin_src scheme :results output
(use-modules (guix packages)
             (gnu packages)
             (ice-9 match))

(define (package->specification package)
  (format #f "~a@~a"
          (package-name package)
          (package-version package)))

(define (input->specification input)
  (match input
    ((label (? package? package) . _)
     (package->specification package))
    (other-item
     (format #f "~a" other-item))))

(define gcc-toolchain
  (specification->package "gcc-toolchain"))

(format #t "Package  : ~a\n"
        (package->specification gcc-toolchain))
(format #t "Inputs   : ~a\n"
        (map input->specification (package-direct-inputs gcc-toolchain)))
(format #t "Internals: ~a\n"
        (map car (package-direct-inputs gcc-toolchain)))

(display "\n")

(define glibc
  (specification->package "glibc"))

(format #t "Name     : ~a\n"
        (package-name glibc))
(format #t "Outputs  : ~a\n"
        (package-outputs glibc))
#+end_src

#+RESULTS:
: Package  : gcc-toolchain@8.3.0
: Inputs   : (gcc@8.3.0 ld-wrapper@0 binutils@2.31.1 glibc@2.28
glibc@2.28 glibc@2.28)
: Internals: (gcc ld-wrapper binutils libc libc-debug libc-static)
:
: Name     : glibc
: Outputs  : (out debug static)

The =car= is not so nice but the =Internals= mitigates, IMHO.

The addition does not add complexity and I hope it clarifies, at least to
me. ;-)


     For reproducibility, all we care  about is the package references. Later
     on, we  will deal with  much longer input lists,  so as a  final cleanup
     step, let's show only unique package references from the list of inputs:
     #+begin_src scheme :results output
     (use-modules (guix packages)
                  (gnu packages)
                  (srfi srfi-1)
                  (ice-9 match))

     (define (package->specification package)
       (format #f "~a@~a"
               (package-name package)
               (package-version package)))

     (define (input->specification input)
       (match input
         ((label (? package? package) . _)
          (package->specification package))
         (other-item
          (format #f "~a" other-item))))

     (define (unique-inputs inputs)
       (delete-duplicates
        (map input->specification inputs)))

     (define gcc-toolchain
       (specification->package "gcc-toolchain"))

     (format #t "Package: ~a\n"
             (package->specification gcc-toolchain))
     (format #t "Inputs : ~a\n"
             (unique-inputs (package-direct-inputs gcc-toolchain)))
     #+end_src

     #+RESULTS:
     : Package: gcc-toolchain@8.3.0
     : Inputs : (gcc@8.3.0 ld-wrapper@0 binutils@2.31.1 glibc@2.28)

     * Dependencies

     You may have noticed the absence  of the term "dependency" from the last
     two sections.  There is  a good  reason for  that: the  term is  used in
     somewhat different meanings, and that  can create confusion. Guix jargon
     therefore avoids it.

     The figure above shows three kinds of input to the build system: source,
     inputs, and arguments. These categories  reflect the packagers' point of
     view: =source= is what the authors  of the software supply, =inputs= are
     other packages, and =arguments= is  what the packagers themselves add to
     the build  procedure. It is important  to understand that from  a purely
     technical point of view, there  is no fundamental difference between the
     three categories. You could, for example, define a package that contains
     C  source code  in the  build  system =arguments=,  but leaves  =source=
     empty. This would be inconvenient, and  confusing for others, so I don't
     recommend you actually do this.  The three categories are important, but
     for humans,  not for computers.  In  fact, even the build  system is not
     fundamentally   distinct  from   its   inputs.   You   could  define   a
     special-purpose build  system for  one package, and  put all  the source
     code in  there. At  the level of  the CPU and  the computer's  memory, a
     build   process   (as   in    fact   /any/   computation)   looks   like

     [[file:computation.png]]

     It is human interpretation that decomposes this into

     [[file:data-code.png]]

     and in a next step into

     [[file:data-program-environment.png]]

     We  can  go  on  and  divide  the  environment  into  operating  system,
     development  tools,  and  application  software, for  example,  but  the
     further  we go  in  decomposing the  input to  a  computation, the  more
     arbitrary it gets.

     From this point of view, a software's dependencies consist of everything
     required to run it  in addition to its source code.  For a Guix package,
     the dependencies are thus,


Adding ',' after 'thus'.


     - its inputs
     - the build system arguments
     - the build system itself
     - Guix (commit)
     - the GNU/Linux operating system (kernel).


Adding (commit) and (kernel).


     In  the following,  I will  not  mention the  last two  items any  more,
     because they  are a  common dependency  of all  Guix packages,  but it's
     important not to forget about them. A change in Guix or in GNU/Linux can
     actually make a computation  non-reproducible, although in practice that
     happens very rarely.   Moreover, Guix is actually designed  to run older
     versions of itself, as we will see later.


Hum? the assumption is the "GNU/Linux operating system" on which Guix
(package manager) is running does not change the reproducibility of the
computations. Right?

In practise, the results should be the same using the same Guix (commit) on
different GNU/Linux operating systems and from my understanding we are
missing data (experience) to report if it happens or not.

However, a change in Guix can lead to completely different packages, so
non-reproducible computations. And in practise it happens often, e.g., see how
many grafts Guix is doing. :-)

Well, I am not sure if I understand correctly the meaning of this paragraph.


     * Build systems are packages as well

     I hope that by  now you have a good idea of what  a package is: a recipe
     for  building outputs  from source  and  inputs, with  inputs being  the
     outputs  of other  packages.  The  recipe involves  a  build system  and
     arguments supplied to it.  So... what  exactly is a build system? I have
     introduced it  as a  generalization of a  compiler, which  describes its
     role. But where does a build system come from in Guix?

     The ultimate  answer is  of course  the
     
[[https://git.savannah.gnu.org/cgit/guix.git/tree/guix/build-system][sourcecode]].
     Build  systems are
     pieces of Guile code that are part of Guix.  But this Guile code is only
     a shallow  layer orchestrating  invocations of  other software,  such as
     =gcc= or  =make=. And that  software is defined  by packages. So  in the
     end, from  a reproducibility point  of view,  we can replace  the "build
     system" item  in our list of  dependenies by "a bundle  of packages". In
     other words: more inputs.

     Before  Guix can  build  a  package, it  must  gather  all the  required
     ingredients,  and  that  includes  replacing the  build  system  by  the
     packages it  represents. The resulting  list of ingredients is  called a
     =bag=, and we can access it using a Guile script:

     #+begin_src scheme :results output
     (use-modules (guix packages)
                  (gnu packages)
                  (srfi srfi-1)
                  (ice-9 match))

     (define (package->specification package)
       (format #f "~a@~a"
               (package-name package)
               (package-version package)))

     (define (input->specification input)
       (match input
         ((label (? package? package) . _)
          (package->specification package))
         ((label (? origin? origin))
          (format #f "[source code from ~a]"
                  (origin-uri origin)))
         (other-input
          (format #f "~a" other-input))))

     (define (unique-inputs inputs)
       (delete-duplicates
        (map input->specification inputs)))

     (define hello
       (specification->package "hello"))

     (format #t "Package       : ~a\n"
             (package->specification hello))
     (format #t "Package inputs: ~a\n"
             (unique-inputs (package-direct-inputs hello)))
     (format #t "Build inputs  : ~a\n"
             (unique-inputs
              (bag-direct-inputs
               (package->bag hello))))
     #+end_src

     #+RESULTS:
     : Package       : hello@2.10
     : Package inputs: ()
     : Build inputs  : ([source code from
mirror://gnu/hello/hello-2.10.tar.gz] tar@1.30 gzip@1.9 bzip2@1.0.6
xz@5.2.4 file@5.33 diffutils@3.6 patch@2.7.6 findutils@4.6.0
gawk@4.2.1 sed@4.5 grep@3.1 coreutils@8.30 make@4.2.1
bash-minimal@4.4.23 ld-wrapper@0 binutils@2.31.1 gcc@5.5.0 glibc@2.28
glibc-utf8-locales@2.28)

     I have used  a different example, =hello=,


[[https://git.savannah.gnu.org/cgit/guix.git/tree/gnu/packages/base.scm#n72][=hello=]]
(link to package definition)


     because for =gcc-toolchain=,
     there is  no difference between  package inputs and build  inputs (check
     for  yourself if  you  want!)  My  new example,  =hello=  (a short  demo


[[https://hpc.guix.info/package/hello][=hello=]] (link to browser)


     program  printing  "Hello,   world"  in  the  language   of  the  system
     installation), is interesting  because it has no package  inputs at all.
     All  the  build  inputs  except  for the  source  code  have  thus  been
     contributed by the build system.

     If you  compare this script  to the previous  one that printed  only the
     package  inputs,   you  will   notice  two   major  new   features.   In
     =input->specification=, there is an additional  case for the source code
     reference. And  in the last  statement, =package->bag= constructs  a bag
     from the package, before =bag-direct-inputs= is called to get that bag's
     input list.

     * Inputs are outputs

     I have  mentioned before that  one package's inputs are  other packages'
     outputs, but  that fact deserves  a more in-depth discussion  because of
     its crucial  importance for reproducibility.  A package is a  recipe for
     building outputs from source and inputs. Since these inputs are outputs,
     they  must have  been built  as well.  Package building  is therefore  a
     process consisting of  multiple steps. An immediate  consequence is that
     any  computation  making  use  of  packaged  software  is  a  multi-step
     computation as well.

     Remember the  short C  program computing  π from  the beginning  of this
     post?  Running that  program is only the  last step in a  long series of
     computations.  Before you  can run =pi=, you must  compile =pi.c=.  That
     requires the  package =gcc-toolchain=,  which must  first be  built. And
     before it can  be built, its inputs  must be built.  And so  on.  If you
     want  the  output of  =pi=  to  be  reproducible,  *the whole  chain  of
     computations must be reproducible*, because each step can have an impact
     on the results produced by =pi=.

     So... where does this chain start?   Few people write machine code these
     days, so almost all software  requires some compiler or interpreter. And
     that means that for every package, there are other packages that must be
     built first. The question  of how to get this chain  started is known as
     the bootstrapping problem.  A rough summary  of the solution is that the
     chain  starts on  somebody else's  computer, which  creates a  bootstrap
     seed, an ideally  small package that is downloaded  in precompiled form.
     See 
[[https://guix.gnu.org/blog/2019/guix-reduces-bootstrap-seed-by-50/][this
post  by Jan Nieuwenhuizen]] for details of  this procedure.  The
     bootstrap seed is not the real start of the chain, but as long as we can
     retrieve  an identical  copy at  a later  time, that's  good enough  for
     reproducibility. In fact, the reason for requiring the bootstrap seed to
     be  small  is not  reproducibility,  but  inspectability: it  should  be
     possible to audit the seed for bugs  and malware, even in the absence of
     source code.


** Closure of bag

     Now we are  finally ready for the ultimate step  in dependency analysis:
     identifying all packages on which a computation depends, right up to the
     bootstrap seed. The  starting point is the list of  direct inputs of the
     bag  derived  from  a  package,  which we  looked  at  in  the  previous
     script.  For  each  package  in  that list,  we  must  apply  this  same
     procedure,  recursively. We  don't have  to write  this code  ourselves,
     because the  function =package-closure=  in Guix does  that job.  If you
     have a basic  knowledge of Scheme, you should be  able to understand its
     
[[https://git.savannah.gnu.org/cgit/guix.git/tree/guix/packages.scm#n817][implementation]]
now. Let's add it to our dependency analysis code:

     #+begin_src scheme :results output
     (use-modules (guix packages)
                  (gnu packages)
                  (srfi srfi-1)
                  (ice-9 match))

     (define (package->specification package)
       (format #f "~a@~a"
               (package-name package)
               (package-version package)))

     (define (input->specification input)
       (match input
         ((label (? package? package) . _)
          (package->specification package))
         ((label (? origin? origin))
          (format #f "[source code from ~a]"
                  (origin-uri origin)))
         (other-input
          (format #f "~a" other-input))))

     (define (unique-inputs inputs)
       (delete-duplicates
        (map input->specification inputs)))

     (define (length-and-list lists)
       (list (length lists) lists))

     (define hello
       (specification->package "hello"))

     (format #t "Package        : ~a\n"
             (package->specification hello))
     (format #t "Package inputs : ~a\n"
             (length-and-list (unique-inputs (package-direct-inputs hello))))
     (format #t "Build inputs   : ~a\n"
             (length-and-list
              (unique-inputs
               (bag-direct-inputs
                (package->bag hello)))))
     (format #t "Package closure: ~a\n"
             (length-and-list
              (delete-duplicates
               (map package->specification
                    (package-closure (list hello))))))
     #+end_src

     #+RESULTS:
     : Package        : hello@2.10
     : Package inputs : (0 ())
     : Build inputs   : (20 ([source code from
mirror://gnu/hello/hello-2.10.tar.gz] tar@1.30 gzip@1.9 bzip2@1.0.6
xz@5.2.4 file@5.33 diffutils@3.6 patch@2.7.6 findutils@4.6.0
gawk@4.2.1 sed@4.5 grep@3.1 coreutils@8.30 make@4.2.1
bash-minimal@4.4.23 ld-wrapper@0 binutils@2.31.1 gcc@5.5.0 glibc@2.28
glibc-utf8-locales@2.28))
     : Package closure: (62 (gzip@1.9 libstdc++-boot0@4.9.4
gcc-cross-boot0@5.5.0 m4@1.4.18 linux-libre-headers@4.14.67
gettext-boot0@0.19.8.1 bison@3.0.5 guile-bootstrap@2.0
glibc-intermediate@2.28 gcc-cross-boot0-wrapped@5.5.0
perl-boot0@5.28.0 bootstrap-binaries@0 file-boot0@5.33
findutils-boot0@4.6.0 diffutils-boot0@3.6 make-boot0@4.2.1
binutils-cross-boot0@2.31.1 ld-wrapper-boot0@0 zlib@1.2.11
libstdc++@5.5.0 ld-wrapper-boot3@0 bash-static@4.4.23 texinfo@6.5
libatomic-ops@7.6.6 pkg-config@0.29.2 gmp@6.1.2 libgc@7.6.6
libltdl@2.4.6 libunistring@0.9.10 libffi@3.2.1 guile@2.2.4 expat@2.2.6
perl@5.28.0 gettext-minimal@0.19.8.1 attr@2.4.47 libcap@2.25
acl@2.2.52 binutils-bootstrap@0 gcc-bootstrap@0 glibc-bootstrap@0
libsigsegv@2.12 lzip@1.20 ed@1.14.2 binutils@2.31.1 glibc@2.28
gcc@5.5.0 bash-minimal@4.4.23 glibc-utf8-locales@2.28 grep@3.1
coreutils@8.30 ld-wrapper@0 make@4.2.1 sed@4.5 gawk@4.2.1
findutils@4.6.0 patch@2.7.6 diffutils@3.6 file@5.33 xz@5.2.4
bzip2@1.0.6 tar@1.30 hello@2.10))


     That's 84 packages,  just for printing "Hello, world!".  As promised, it


How do you obtain this 84 packages?


     includes the boostrap seed, called =bootstrap-binaries=.  It may be more
     surprising to see  Perl and Python in  the dependency list of  what is a
     pure C program.  The explanation is that the build  process of =gcc= and
     =glibc= contains  Perl and Python  code. Considering that both  Perl and
     Python are written in C and use =glibc=, this hints at why bootstrapping
     is a hard problem!


** Ready to analyse yourself


     As promised, here is  a [[file:show-dependencies.scm][Guile
script]] that you can  download and run from
     the command  line to do  dependency analyses much  like the ones  I have
     shown. Just  give the packages  whose combined list of  dependencies you
     want to analyze. For example:
     #+begin_src sh :results output :exports both
     ./show-dependencies.scm hello
     #+end_src

     #+RESULTS:
     : Packages: 1
     :   hello@2.10
     : Package inputs: 0 packages
     :
     : Build inputs: 20 packages
     :   [source code from mirror://gnu/hello/hello-2.10.tar.gz]
bash-minimal@5.0.7 binutils@2.32 bzip2@1.0.6 coreutils@8.31
diffutils@3.7 file@5.33 findutils@4.6.0 gawk@5.0.1 gcc@7.4.0
glibc-utf8-locales@2.29 glibc@2.29 grep@3.3 gzip@1.10 ld-wrapper@0
make@4.2.1 patch@2.7.6 sed@4.7 tar@1.32 xz@5.2.4
     : Package closure: 84 packages
     :   acl@2.2.53 attr@2.4.48 bash-minimal@5.0.7 bash-static@5.0.7
binutils-cross-boot0@2.32 binutils-mesboot0@2.20.1a
binutils-mesboot@2.20.1a binutils@2.32 bison@3.4.1
bootstrap-binaries@0 bootstrap-mes@0 bootstrap-mescc-tools@0.5.2
bzip2@1.0.6 coreutils@8.31 diffutils-boot0@3.7 diffutils-mesboot@2.7
diffutils@3.7 ed@1.15 expat@2.2.7 file-boot0@5.33 file@5.33
findutils-boot0@4.6.0 findutils@4.6.0 flex@2.6.4 gawk@5.0.1
gcc-core-mesboot@2.95.3 gcc-cross-boot0-wrapped@7.4.0
gcc-cross-boot0@7.4.0 gcc-mesboot-wrapper@4.9.4 gcc-mesboot0@2.95.3
gcc-mesboot1-wrapper@4.7.4 gcc-mesboot1@4.7.4 gcc-mesboot@4.9.4
gcc@7.4.0 gettext-boot0@0.19.8.1 gettext-minimal@0.20.1
glibc-headers-mesboot@2.16.0 glibc-intermediate@2.29
glibc-mesboot0@2.2.5 glibc-mesboot@2.16.0 glibc-utf8-locales@2.29
glibc@2.29 gmp@6.1.2 grep@3.3 guile-bootstrap@2.0 guile@2.2.6
gzip@1.10 hello@2.10 ld-wrapper-boot0@0 ld-wrapper-boot3@0
ld-wrapper@0 libatomic-ops@7.6.10 libcap@2.27 libffi@3.2.1
libgc@7.6.12 libltdl@2.4.6 libsigsegv@2.12 libstdc++-boot0@4.9.4
libstdc++@7.4.0 libunistring@0.9.10 libxml2@2.9.9
linux-libre-headers-bootstrap@0 linux-libre-headers@4.19.56 lzip@1.21
m4@1.4.18 make-boot0@4.2.1 make-mesboot0@3.80 make-mesboot@3.82
make@4.2.1 mes-boot@0.19 mesboot-headers@0.19 ncurses@6.1-20190609
patch@2.7.6 perl-boot0@5.30.0 perl@5.30.0 pkg-config@0.29.2
python-minimal@3.5.7 sed@4.7 tar@1.32 tcc-boot0@0.9.26-6.c004e9a
tcc-boot@0.9.27 texinfo@6.6 xz@5.2.4 zlib@1.2.11

     You can now easily experiment yourself, even if you are not at ease with
     Guile. For  example, suppose you have  a small Python script  that plots
     some data using matplotlib. What  are its dependencies? First you should
     check that it runs in a minimal environment:
     #+begin_src sh :results output :exports both :eval no
     guix  environment --container --ad-hoc python python-matplotlib
-- python my-script.py
     #+end_src
     Next, find its dependencies:
     #+begin_src sh :results output :exports both :eval no
     ./show-dependencies.scm python python-matplotlib
     #+end_src
     I won't  show the output  here because it is  rather long -  the package
     closure contains 499 packages!

     * OK, but... what are the /real/ dependencies?

     I   have   explained  dependencies   along   these   lines  in   a   few
     seminars. There's one question that someone  in the audience is bound to
     ask:  What do  the results  of a  computation /really/  depend on?   The
     output of =hello= is ="Hello, world!"=, no matter which version of =gcc=
     I use to compile it, and no matter which version of =python= was used in
     building  =glibc=. The  package  closure is  a  worst-case estimate:  it
     contains everything that can /potentially/ influence the results, though
     most  of it  doesn't  in practice.  Unfortunately, there  is  no way  to
     identify the  dependencies that matter automatically,  because answering
     that question in general (i.e.  for arbitrary software) is equivalent to
     solving the
[[https://en.wikipedia.org/wiki/Halting_problem][halting problem]].

     Most  package managers,  such as  Debian's =apt=  or the  multi-platform
     =conda=, take a different point of view. They define the dependencies of
     a program as all packages that need to be loaded into memory in order to
     run it. They  thus exclude the software that is  required to /build/ the
     program  and  its run-time  dependencies,  but  can then  be  discarded.
     Whereas Guix' definition  errs on the safe side (its  dependency list is
     often  longer than  necessary but  never too  short), the  run-time-only
     definition  is  both  too  vast   and  too  restrictive.  Many  run-time
     dependencies don't  have an impact  on most programs' results,  but some
     build-time dependencies do.


>From my point of view, an essential point of this "worst-case estimate" is:
time travelling. Because the closure is well-defined, it is possible to
restore the complete set of the dependencies. And it is not possible with the
other point of view, if I understand correctly.


     One   important   case   where   build-time   dependencies   matter   is
     floating-point computations. For historical reasons, they are surrounded
     by an  aura of vagueness and  imprecision, which goes back  to its early
     days,  when  many details  were  poorly  understood and  implementations
     varied a lot. Today, all computers used for scientific computing respect
     the [[https://en.wikipedia.org/wiki/IEEE_754][IEEE 754 standard]]
that  precisely defines how floating-point numbers
     are  represented  in memory  and  what  the  result of  each  arithmetic
     operation  must   be.   Floating-point  arithmetic  is   thus  perfectly
     deterministic and even perfectly portable between machines, if expressed
     in terms of the operations  defined by the standard. However, high-level
     languages such as C or Fortran do  not allow programmers to do that. Its
     designers assume (probably correctly) that  most programmers do not want
     to deal with the intricate  details of rounding.  Therefore they provide
     only a  simplified interface  to the arithmetic  operations of  IEE 754,


Missing E at IEEE.


     which incidentally also  provides more liberty for  code optimization to
     compiler writers. The net result is that the complete specification of a
     program's  results  is  its  source  code /plus  the  compiler  and  the
     compilation  options/. You  thus /can/  get reproducible  floating-point
     results if you include all compilation  steps into the perimeter of your
     computation, at least  for code running on a  single processor. Parallel
     computing  is  a different  story:  it  involves voluntarily  giving  up
     reproducibility in  exchange for  speed. Reproducibility then  becomes a
     best-effort  approach   of  limiting  the  collateral   damage  done  by
     optimization through the clever design of algorithms.


It is out of scope and I have never read the IEEE 754 standard, so I do not
know if this simple propagation of errors depends on the compiler suite
and/or the machine.

#+begin_src C
#include <stdio.h>

int main() {
    double x = 0.;

    for (int i = 1; i < 10; i++) {
        x = x + 0.1;
        printf("(%d) x=%0.20f\n", i, x);
    }
    return 0;
}
#+end_src

And I do not know neither if the standard fixes associativity rules when no
parenthesis is provided or if it is up to the compiler.

#+begin_src C
#include <stdio.h>

int main() {
    float x;
    float r1, r2, r3, r4;

    x = 1.0e21;
    r1 = x + 1 - x + 1;
    r2 = (x + 1) - (x - 1);
    r3 = x + (1 - x) + 1;
    r4 = x + (1 - (x - 1));

    printf(" x + 1  -  x + 1 =%f\n", r1);
    printf("(x + 1) - (x - 1)=%f\n", r2);
    printf(" x +(1  -  x)+ 1 =%f\n", r3);
    printf(" x +(1 -  (x+ 1))=%f\n", r4);
    return 0;
}
#+end_src


     * Reproducing a reproducible computation

     So   far,   I   have    explained   the   theory   behind   reproducible
     computations. The  take-home message is that  to be sure to  get exactly
     the same results in the future, you  have to use the exact same versions
     of all packages in the package closure of your immediate dependencies. I
     have also  shown you how you  can access that package  closure. There is
     one missing piece:  how do you actually run your  program in the future,
     using the same environment?

     The good news is that doing this  is a lot simpler than understanding my
     lengthy  explanations (which  is why  I leave  this for  the end!).  The
     complex dependency graphs that I have analyzed up to here are encoded in
     the Guix source  code, so all you need to  re-create your environment is
     the exact same version of Guix!  You get that version using
     #+begin_src sh :results output :exports both
     guix describe
     #+end_src

     #+RESULTS:
     : Generation 15    Jan 06 2020 13:30:45    (current)
     :   guix 769b96b
     :     repository URL: https://git.savannah.gnu.org/git/guix.git
     :     branch: master
     :     commit: 769b96b62e8c09b078f73adc09fb860505920f8f

     The  critical information  here is  the unpleasantly  looking string  of
     hexadecimal digits  after "commit".   This is all  it takes  to uniquely
     identify a version of Guix. And to re-use it in the future, all you need
     is Guix' time machine:

     #+begin_src sh :session reproduce-C-compiler :results output :exports both
     guix time-machine
--commit=769b96b62e8c09b078f73adc09fb860505920f8f -- environment
--ad-hoc gcc-toolchain
     #+end_src

     #+RESULTS:
     :
     : Updating channel 'guix' from Git repository at
'https://git.savannah.gnu.org/git/guix.git'...

     #+begin_src sh :session reproduce-C-compiler :results output :exports both
     gcc pi.c -o pi ./pi
     #+end_src

     #+RESULTS:
     :
     : pi = 3.1415926536
     : 4 * atan(1.): 3.1415926536
     : Leibniz' formula (four terms): 2.8952380952

     The time  machine actually downloads  the specified version of  Guix and
     passes it the rest  of the command line.  You are  running the same code
     again. Even bugs in Guix will be reproduced faithfully!

     For many practical  use cases, this technique is  sufficient.  But there
     are two variants you should know about for more complicated situations:

      - If  you need  an environment  with many  packages, you  should use  a
        manifest rather than  list the packages on the command  line.
See 
[[https://guix.gnu.org/manual/en/html_node/Invoking-guix-environment.html][the
        manual]] for details.

      - If you need packages from additional channels, i.e. packages that are
        not  part of  the  official  Guix distribution,  you  should store  a
        complete channel description in a file using
     #+begin_src sh :results none :exports code
     guix describe -f channels > guix-version-for-reproduction.txt
     #+end_src

     and feed that file to the time machine:
     #+begin_src sh :session reproduce-C-compiler-2 :results output
:exports both
     guix time-machine --channels=guix-version-for-reproduction.txt --
environment --ad-hoc gcc-toolchain
     #+end_src

     #+RESULTS:
     :
     : Updating channel 'guix' from Git repository at
'https://git.savannah.gnu.org/git/guix.git'...

     #+begin_src sh :session reproduce-C-compiler-2 :results output
:exports both
     gcc pi.c -o pi ./pi
     #+end_src

     #+RESULTS:
     :
     : pi = 3.1415926536
     : 4 * atan(1.): 3.1415926536
     : Leibniz' formula (four terms): 2.8952380952


Last, if your colleague does not use yet Guix, then let pack (plain tarball,
Docker or Singularity containers) and provide the image. For example,

#+begin_src sh :results none :exports code
guix pack            \
     -f docker       \
     -C none         \
     -S /bin=bin     \
     -S /lib=lib     \
     -S /share=share \
     -S /etc=etc     \
     gcc-toolchain
#+end_src

and knowing the Guix commit (channel), you will be able in the future to
reproduce bit-to-bit this container using =guix time-machine=.


     And now...  congratulations for having survived  to the end of this long
     journey!  May all your computations be reproducible, with Guix.

Re: Proposal for a blog contribution on reproducible computations

Reply via email to