Re: GNU Coding Standards, automake, and the recent xz-utils backdoor

2024-04-01 Thread Jacob Bachmeyer

Russ Allbery wrote:

[...]

There is extensive ongoing discussion of this on debian-devel.  There's no
real consensus in that discussion, but I think one useful principle that's
emerged that doesn't disrupt the world *too* much is that the release
tarball should differ from the Git tag only in the form of added files.
  


From what I understand, the xz backdoor would have passed this check.  
The backdoor dropper was hidden in test data files that /were/ in the 
repository, and required code in the modified build-to-host.m4 to 
activate it.  The m4 files were not checked into the repository, instead 
being added (presumably by running autogen.sh with a rigged local m4 
file collection) while preparing the release.


Someone with a copy of a crocked release tarball should check if 
configure even had the backdoor "as released" or if the attacker was 
/depending/ on distributions to regenerate configure before packaging xz.



-- Jacob




Re: GNU Coding Standards, automake, and the recent xz-utils backdoor

2024-04-01 Thread Jacob Bachmeyer

Jose E. Marchesi wrote:

Jose E. Marchesi wrote:


[...]



I agree that distcheck is good but not a cure all.  Any static
system can be attacked when there is motive, and unit tests are
easily gamed.
  
  

The issue seems to be releases containing binary data for unit tests,
instead of source or scripts to generate that data.  In this case,
that binary data was used to smuggle in heavily obfuscated object
code.



As a side note, GNU poke (https://jemarch.net/poke) is good for
generating arbitrarily complex binary data from clear textual
descriptions.
  

While it is suitable for that use, at last check poke is itself very
complex, complete with its own JIT-capable VM.  This is good for
interactive use, but I get nervous about complexity in testsuites,
where simplicity can greatly aid debugging, and it /might/ be possible
to hide a backdoor similarly in a poke pickle.  (This seems to be a
general problem with powerful interactive editors.)



Yes, I agree simplicity it is very desirable, in testsuites and actually
everywhere else.  I also am not fond of dragging in dependencies.
  


Exactly---I am sure that poke is great for interactive use, but a 
self-contained solution is probably better for a testsuite.



But I suppose we also agree in that it is not possible to assembly
non-trivial binary data structures in a simple way, without somehow
moving the complexity of the encoding into some sort of generator, which
will not be simple.  The GDB testsuite, for example, ships with a DWARF
assembler written in around 3000 lines of Tcl.  Sure, it is simpler than
poke and doesn't drag in additional dependencies.  But it has to be
carefully maintained and kept up to date, and the complexity is there.
  


The problem for a compression tool testsuite is that compression formats 
are (I believe) defined as byte-streams or bit-streams.  Further, the 
generator(s) must be able to produce /incorrect/ output as well, in 
order to test error handling.



Further, GNU poke defines its own specialized programming language for
manipulating binary data.  Supplying generator programs in C (or C++)
for binary test data in a package that itself uses C (or C++) ensures
that every developer with the skills to improve or debug the package
can also understand the testcase generators.



Here we will have to disagree.

IMO it is precisely the many and tricky details on properly marshaling
binary data in general-purpose programming languages that would have
greater odds to lead to difficult to understand, difficult to maintain
and possibly buggy or malicious encoders.  The domain specific language
is here an advantage, not a liability.

This you need to do in C to encode and generate test data for a single
signed 32-bit NUMBER in an output file in a _more or less_ portable way:

  void generate_testdata (off_t offset, int endian, int number)
  {
int bin_flag = 0, fd;

  #ifdef _WIN32
int bin_flag = O_BINARY;
  #endif
fd = open ("testdata.bin", bin_flag, S_IWUSR);
if (fd == -1)
  fatal ("error generating data.");

if (endian == BIG)

  {
b[0] = (number >> 24) & 0xff;
b[1] = (number >> 16) & 0xff;
b[2] = (number >> 8) & 0xff;
b[3] = number & 0xff;
  }
else
  {
b[3] = (number >> 24) & 0xff;
b[2] = (number >> 16) & 0xff;
b[1] = (number >> 8) & 0xff;
b[0] = number & 0xff;
  }

lseek (fd, offset, SEEK_SET);
for (i = 0; i < 4; ++i)
  write (fd, [i], 1);
close (fd);
  }
  


While that is a nice general solution, (aside from neglecting the 
declaration "uint8_t b[4];"; with "int b[4];", the code would only work 
on a little-endian processor; with no declaration, the compiler will 
reject it) a compression format would be expected to define the 
endianess of stored values, so the major branch in that function would 
collapse to just one of its alternatives.  Compression formats are 
generally defined as streams, so a different decomposition of the 
problem would likely make more sense:  (example untested)


   void emit_int32le (FILE * out, int value)
   {
 unsigned int R, i;

 for (R = (unsigned int)value, i = 0; i < 4; R = R >> 8, i++)
   if (fputc(R & 0xff, out) == EOF)
 fatal("error writing int32le");
   }
 

Other code handles opening OUT, or OUT is actually stdout and we are 
writing down a pipe or the shell handled opening the file.  (The main 
function can easily check that stdout is not a terminal and bail out if 
it is.)  Remember that I am suggesting test generator programs, which do 
not need to be as general as ordinary code, nor do they need the same 
level of user-friendliness, since they are expected to be run from 
scripts that encode the precise knowledge of how to call them.  (That 
this version is also probably more efficient by avoiding a syscall for 
every byte written is irrelevant for its intended use.)



This is 

Re: GNU Coding Standards, automake, and the recent xz-utils backdoor

2024-04-01 Thread Jacob Bachmeyer

Zack Weinberg wrote:

[...] but I do think there's a valid point here: the malicious xz
maintainer *might* have been caught earlier if they had committed the
build-to-host.m4 modification to xz's VCS.


That would require someone to notice that xz.git has a build-to-host.m4 
that does not exist anywhere in the history of gnulib.git.  That is a 
fairly complex scan, although it does look straightforward to 
implement.  That said, the m4 files in Gnulib *are* Free Software, so 
having a modified version cannot itself raise too many concerns.



  (Or they might not have!
Witness the three (and counting) malicious patches that they barefacedly
submitted to *other* software and got accepted because the malice was
subtle enough to pass through code review.)
  


Exactly.  :-/

That said, the whole thing looks to me like the attackers were trying to 
/not/ hit the more (what is the best word?) "advanced" users---the 
backdoor would only be inserted if building distribution packages, and 
then only under dpkg or rpm, not other systems like Gentoo's Portage or 
in an unpackaged "./configure && make && sudo make install" build.  This 
would, of course, hit the most widely used systems, including (reports 
are that the sock farm tried very hard to get Ubuntu to ship the crocked 
version in their upcoming release, but the freeze was upheld) the 
systems most commonly used by less technically-skilled users, but 
pointedly exclude systems that require greater skill to use---and whose 
users would be more likely to notice anything amiss and start tearing 
the system apart with the debugger.  Unfortunately for Mr. Sockmaster, 
it turns out that some highly-skilled users *do* use the widely-used 
systems and the backdoor caused sshd to misbehave enough to draw 
suspicion.  (Profiling reports that sshd is spending most of its time in 
liblzma---a library it has no reason to use---will tend to raise a few 
eyebrows.  :-)  )



[...]
  
Maybe the best revision to the GNU Coding Standards would be that 
releases should, if at all possible, contain only text?  Any binary 
files needed for testing can be generated during "make check" if 
necessary



I don't think this is a good idea.  It's only a speed bump for someone
trying to smuggle malicious data into a package (think "base64 -d") and
it makes life substantially harder for honest authors of programs that
work with binary data, and authors of material whose "source code"
(as GPLv3 uses that term) *is* binary data.  Consider pngsuite, for
instance (http://www.schaik.com/pngsuite/) -- it would be a *ton* of
work to convert each of these test PNG files into GNU Poke scripts,
and probably the result would be *less* ergonomic for purposes of
improving the test suite.
  


That is a bad example because SNG (https://sng.sourceforge.net/>) 
exists precisely to provide a a text representation of PNG binary 
structures.  (Admittedly, if I recall correctly, the contents of IDAT 
are simply a hexdump.)


While we are on the topic, this leaves the other obvious place to hide 
binary data:  images used as part of the manual.  There is a reason that 
I added the "if at all possible" caveat, and I am not certain that it is 
always possible.



I would like to suggest that a more useful policy would be "files
written to $prefix by 'make install' should not have any data
dependency on files labeled as part of the package's testsuite".
That doesn't constrain honest authors and it seems within the
scope of what the reproducible builds people could test for.
(Build the package, install to nonce prefix 1, unpack the tarball
again, delete the test suite, build again, install to prefix 2, compare.)
Of course a sufficiently determined malicious coder could detect
the reproducible-build test environment, but unlike "no binary data"
this is a substantial difficulty increment.


This could be a good idea.  Another way to check this even without 
reproducible builds would be to ensure that the access timestamps on 
testsuite files do not change while "make" is processing the main 
sources.  Checking this is slightly more invasive, since you would need 
to run a hook between processing top-level directories during the main 
build, but for packages using recursive Automake, you could simply run 
"make -C src" (or wherever the main sources are) and make sure that the 
testsuite files still have the same atime afterwards.  I admit that this 
is harder to automate in general, but distribution packaging processes 
already have other metadata that is manually maintained, so identifying 
the source subtrees that yield the installable artifacts should not be 
difficult.


Now that I think about it, I suggest tightening that policy a bit 
further:  "files produced by make in the source subtree (typically src/) 
shall have no data dependency on files outside of that tree"


I doubt anyone ever thought that recursive make could end up as 
security/verifiability feature.  8-|



-- Jacob



Re: GNU Coding Standards, automake, and the recent xz-utils backdoor

2024-04-01 Thread Russ Allbery
Jacob Bachmeyer  writes:

> The m4 files were not checked into the repository, instead being added
> (presumably by running autogen.sh with a rigged local m4 file
> collection) while preparing the release.

Ah, yes, I think you are correct.  For some reason I thought the
legitimate build-to-host.m4 had been checked into the repository, but this
is indeed not the case.

-- 
Russ Allbery (ea...@eyrie.org) 



Re: GNU Coding Standards, automake, and the recent xz-utils backdoor

2024-04-01 Thread Richard Stallman
[[[ To any NSA and FBI agents reading my email: please consider]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > `distcheck` target's prominence to recommend it in the "Standard
  > Targets for All Users" section of the GCS? 

  > Replying as an Automake developer, I have nothing against it in
  > principle, but it's clearly up to the GNU coding standards
  > maintainers. As far as I know, that's still rms (for anything
  > substantive)

To make a change in the coding standards calls for a clear and
specific proposal.  If people think a change is desirable, I suggest
making one or more such proposals.

Now for a bit of speculation.  I speculate that a cracker was careless
and failed to adjust certain details of a bogus tar ball to be fully
consistent, and that `make distcheck' enabled somene to notice those
errors.

I don't have any real info about whether that is so.  If my
speculation is mistaken, please say so.  But supposing it is correct:

If we had publicized `make distcheck' more, would that have been
likely to help people detect the bogus tar ball sooner?  Or would it
have been likely to help the cracker be more careful about avoiding
such signs?  Would they balance out?


-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





role of GNU build system in recent xz-utils backdoor

2024-04-01 Thread Richard Stallman
[[[ To any NSA and FBI agents reading my email: please consider]]]
[[[ whether defending the US Constitution against all enemies, ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > I was recently reading about the backdoor announced in xz-utils the
  > other day, and one of the things that caught my attention was how
  > (ab)use of the GNU build system played a role in allowing the backdoor
  > to go unnoticed: https://openwall.com/lists/oss-security/2024/03/29/4

You've brought up an idea for catching cracks by certain kinds of
mistakes.  Thank you.

Your message seems to say that there is or was some other problem in
one of the GNU autotools.  Without details, I can't be sure you mean
that.

Is that really so?  If so, which programs are involved?
Has it been reported to their maintainers?

I don't want to get involved in fixing the bug, but I want to
make sure the GNU Project is working on it.

-- 
Dr Richard Stallman (https://stallman.org)
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)





Re: libsystemd dependencies

2024-04-01 Thread Jacob Bachmeyer

Bruno Haible wrote:

Jacob Bachmeyer wrote:
  
some of the blame for this needs to fall on the 
systemd maintainers and their "katamari" architecture.  There is no good 
reason for notifications of daemon startup to pull in liblzma, but using 
libsystemd for that purpose does exactly that, and ended up getting 
xz-utils targeted as a means of getting to sshd without the OpenSSH 
maintainers noticing.



The systemd people are working on reducing the libsystemd dependencies:
https://github.com/systemd/systemd/issues/32028

However, the question remains unanswered why it needs 3 different
compression libraries (liblzma, libzstd, and liblz4). Why would one
not suffice?
  


From reading other discussions, the only reason libsystemd pulls in 
compression libraries at all is its "katamari" architecture:  the 
systemd journal can be optionally compressed with any of those 
algorithms, and the support for reading the journal (which libsystemd 
also provides) therefore requires support for all of them.  No, sshd 
(even with the distribution patches at issue) does /not/ use that 
support whatsoever.


Better design would split libsystemd into separate libraries:  
libsystemd-notify, libsystemd-journal, etc.  I suspect that there are 
more logically distinct modules that have been "katamaried" into one 
excessively large library.  The C runtime library has an excuse for 
being such an agglomeration, but also note that libc has *zero* hard 
external dependencies.  You can ridicule NSS if you like, but NSS 
modules are only loaded if NSS is used.  (To be fair, sshd almost 
certainly /does/ use functions provided by NSS.)  The systemd developers 
do not have that excuse, and their library *has* external dependencies.


I believe the systemd developers cite convenience as justification for 
the practice, because apparently figuring out which libraries (out of a 
set partitioned based on functionality) you need to link is "too hard" 
for developers these days.  (Perhaps that is the real reason they want 
to replace X11?)  That "convenience" nearly got all servers on the 
Internet running the major distributions backdoored with critical 
severity and we do not yet know exactly what the backdoor blob did.  The 
preliminary reports that it was an RCE backdoor that would pass commands 
smuggled in public key material in SSH certificates to system(3) (as 
root of course, since that is sshd's context at that stage) are 
inconsistent with the slowdown that caused the backdoor to be 
discovered.  I doubt that SSH logins were using that code path, and the 
SSH scanning botnets almost certainly are not presenting certificates, 
yet it apparently (reports have been unclear on this point) was the 
botnet scanning traffic that led to the discovery of sshd wasting 
considerable CPU time in liblzma...


I am waiting for the proverbial other shoe to drop on that one.


-- Jacob



Re: GNU Coding Standards, automake, and the recent xz-utils backdoor

2024-04-01 Thread Jacob Bachmeyer

Zack Weinberg wrote:

On Mon, Apr 1, 2024, at 2:04 PM, Russ Allbery wrote:
  

"Zack Weinberg"  writes:


It might indeed be worth thinking about ways to minimize the
difference between the tarball "make dist" produces and the tarball
"git archive" produces, starting from the same clean git checkout,
and also ways to identify and audit those differences.
  

There is extensive ongoing discussion of this on debian-devel. There's
no real consensus in that discussion, but I think one useful principle
that's emerged that doesn't disrupt the world *too* much is that the
release tarball should differ from the Git tag only in the form of
added files. Any files that are present in both Git and in the release
tarball should be byte-for-byte identical.



That dovetails nicely with something I was thinking about myself.
Obviously the result of "make dist" should be reproducible except for
signatures; to the extent it isn't already, those are bugs in automake.
But also, what if "make dist" produced *two* disjoint tarballs? One of
which is guaranteed to be byte-for-byte identical to an archive of the
VCS at the release tag (in some clearly documented fashion; AIUI, "git
archive" does *not* do what we want).  The other contains all the files
that "autoreconf -i" or "./bootstrap.sh" or whatever would create, but
nothing else.  Diffs could be provided for both tarballs, or only for
the VCS-archive tarball, whichever turns out to be more compact (I can
imagine the diff for the generated-files tarball turning out to be
comparable in size to the generated-files tarball itself).


The way to do that is to detect that "make dist" is being run in a VCS 
checkout, ask the VCS which files are in version control, and assume the 
others were somehow "brought in" by autogen.sh or whatever.  The problem 
is that now Automake needs to start growing support for varying version 
control systems, unless we /really/ want to say that this feature only 
works with Git.


The problem is that now the disjoint tarballs both need to be unpacked 
in the same directory to build the package and once that is done, how 
does "make dist" rebuild the distribution it was run from?  The file 
lists would need to be stored in the generated-files tarball.


The other problem is that this really needs to be an option.  DejaGnu, 
for example, stores the Autotools-generated files in Git and releases 
are just snapshots of the working tree.  (DejaGnu can also now *run* 
from a Git checkout without actually installing it, but that is a 
convenience limited to interpreted languages.)


Lastly, publishing a modified (third-party) distribution derived from a 
release instead of VCS *is* permitted.  (I believe this is a case of 
freedom 3.)  How would this feature interact with that?



-- Jacob




Re: GNU Coding Standards, automake, and the recent xz-utils backdoor

2024-04-01 Thread Eric Gallager
On Tue, Apr 2, 2024 at 12:04 AM Jacob Bachmeyer  wrote:
>
> Russ Allbery wrote:
> > [...]
> >
> > There is extensive ongoing discussion of this on debian-devel.  There's no
> > real consensus in that discussion, but I think one useful principle that's
> > emerged that doesn't disrupt the world *too* much is that the release
> > tarball should differ from the Git tag only in the form of added files.
> >
>
>  From what I understand, the xz backdoor would have passed this check.
> The backdoor dropper was hidden in test data files that /were/ in the
> repository, and required code in the modified build-to-host.m4 to
> activate it.  The m4 files were not checked into the repository, instead
> being added (presumably by running autogen.sh with a rigged local m4
> file collection) while preparing the release.
>
> Someone with a copy of a crocked release tarball should check if
> configure even had the backdoor "as released" or if the attacker was
> /depending/ on distributions to regenerate configure before packaging xz.
>
>
> -- Jacob
>

I would like to clarify that my purpose in starting this thread wasn't
so much to ask, "How could the xz backdoor specifically have been
prevented?" (which seems pretty clearly impossible) but rather, "How
can we use this incident as inspiration for general-purpose
improvements to the GNU Coding Standards and related tools?" In other
words, even if a proposal wouldn't have stopped this particular
attack, I don't think that's a reason not to try it.



Re: automated release building service

2024-04-01 Thread Jacob Bachmeyer

Bruno Haible wrote:

Jacob Bachmeyer wrote:
  

Essentially, this would be an automated release building service:  upon
request, make a Git checkout, run autogen.sh or equivalent, make dist,
and publish or hash the result.  The problem is that an attacker who
manages to gain commit access to a repository may be able to launch
attacks on the release building service, since "make dist" can run
scripts.  The service could probably mount the working filesystem noexec
since preparing source releases should not require running (non-system)
binaries and scripts can be run by directly feeding them into their
interpreters even if the filesystem is mounted noexec, but this still
leaves all available interpreters and system tools potentially available.



Well, it'd at least make things more difficult for the attacker, even
if it wouldn't stop them completely.
  
  
Actually, no, it would open a *new* target for attackers---the release 
building service itself.  Mounting the scratchpad noexec would help to 
complicate attacks on that service, but right now there is *no* central 
point for an attacker to hit to compromise releases.  If a central 
release building service were set up, it would be a target, and an 
attacker able to arrange a persistent compromise of the service could 
then tamper with later releases as they are built.  This should be 
fairly easy to catch, if an honest maintainer has a secure environment, 
("Why the  does the central release service tarball not match mine?  
And what the  is the extra code in this diff between its tarball 
and mine!?") but there is a risk that, especially for large projects, 
maintainers start relying on the central release service instead of 
building their own tarballs.


The problem here was not a maintainer with a compromised system---it 
seems that "Jia Tan" was a malefactor's sock puppet from the start.



There are several problems that such an automated release building service
would create. Here are a couple of them:

* First of all, it's a statement of mistrust towards the developer/maintainer,
  if developers get pressured into using an automated release building
  service rather than producing the tarballs on their own.
  This demotivates and turns off developers, and it does not fix the
  original problem: If a developer is in fact a malefactor, they can
  also commit malicious code; they don't need to own the release process
  in order to do evil things.
  


Limiting trust also limits the value of an attack as well, thus 
protecting the developers/maintainers from at least sane attackers in 
some ways.  I also think that this point misunderstands the original 
proposal (or I have misunderstood it).  To some extent, projects using 
Automake already have that automated release building service; we call 
it "make dist" and it is a distributed service running on each 
maintainer's machine, including distribution package maintainers who 
regenerate the Autotools files.  A compromise of a developer's machine 
is thus valuable as it allows to tamper with releases, but the risk is 
managed somewhat by each developer building only their own releases.


A central service as a "second opinion" would be a risk, but would also 
make those compromises even more difficult---now the attacker must hit 
both the central service *and* the dev box *and* coordinate to ensure 
that only packages prepared at the central service for which the 
maintainer's own machine is cracked are tampered, lest the whole thing 
be detected.  This is even harder on the attacker, which is a good 
thing, of course.


The more dangerous risk is that the central service becomes overly 
trusted and ceases to be merely the "second opinion" on a release.  If 
that occurs, not only would we be right back to no real check on the 
process, but now *all* the releases come from one place.  A compromise 
of the central release service would then allow *all* releases to be 
tampered, which is considerably more valuable to an attacker.



* Such an automated release building service is a piece of SaaSS. I can
  hardly imagine how we at GNU tell people "SaaSS is as bad as, or worse
  than, proprietary software" and at the same time advocate the use of
  such a service.
  


As long as it runs on published Free Software and anyone is free to set 
up their own instance, I would disagree here.  I think we need to work 
out where the line between "hosting" and "SaaSS" actually is, and I am 
not sure that it has a clear technical description, since SaaSS is 
ultimately an ethical issue.



* Like Jacob mentioned, such a service quickly becomes a target for
  attackers. So, instead of trusting a developer, you now need to trust
  the technical architecture and the maintainers of such a service.
  


I think I may know an example of something similar:  if I recall 
correctly, F-Droid originally would only distribute apps built on their 
own compile farm, to guard against malicious 

Re: GNU Coding Standards, automake, and the recent xz-utils backdoor

2024-04-01 Thread Jose E. Marchesi


> Jose E. Marchesi wrote:
>>> [...]
>>> 
 I agree that distcheck is good but not a cure all.  Any static
 system can be attacked when there is motive, and unit tests are
 easily gamed.
   
>>> The issue seems to be releases containing binary data for unit tests,
>>> instead of source or scripts to generate that data.  In this case,
>>> that binary data was used to smuggle in heavily obfuscated object
>>> code.
>>> 
>>
>> As a side note, GNU poke (https://jemarch.net/poke) is good for
>> generating arbitrarily complex binary data from clear textual
>> descriptions.
>
> While it is suitable for that use, at last check poke is itself very
> complex, complete with its own JIT-capable VM.  This is good for
> interactive use, but I get nervous about complexity in testsuites,
> where simplicity can greatly aid debugging, and it /might/ be possible
> to hide a backdoor similarly in a poke pickle.  (This seems to be a
> general problem with powerful interactive editors.)

Yes, I agree simplicity it is very desirable, in testsuites and actually
everywhere else.  I also am not fond of dragging in dependencies.

But I suppose we also agree in that it is not possible to assembly
non-trivial binary data structures in a simple way, without somehow
moving the complexity of the encoding into some sort of generator, which
will not be simple.  The GDB testsuite, for example, ships with a DWARF
assembler written in around 3000 lines of Tcl.  Sure, it is simpler than
poke and doesn't drag in additional dependencies.  But it has to be
carefully maintained and kept up to date, and the complexity is there.

> Further, GNU poke defines its own specialized programming language for
> manipulating binary data.  Supplying generator programs in C (or C++)
> for binary test data in a package that itself uses C (or C++) ensures
> that every developer with the skills to improve or debug the package
> can also understand the testcase generators.

Here we will have to disagree.

IMO it is precisely the many and tricky details on properly marshaling
binary data in general-purpose programming languages that would have
greater odds to lead to difficult to understand, difficult to maintain
and possibly buggy or malicious encoders.  The domain specific language
is here an advantage, not a liability.

This you need to do in C to encode and generate test data for a single
signed 32-bit NUMBER in an output file in a _more or less_ portable way:

  void generate_testdata (off_t offset, int endian, int number)
  {
int bin_flag = 0, fd;

  #ifdef _WIN32
int bin_flag = O_BINARY;
  #endif
fd = open ("testdata.bin", bin_flag, S_IWUSR);
if (fd == -1)
  fatal ("error generating data.");

if (endian == BIG)
  {
b[0] = (number >> 24) & 0xff;
b[1] = (number >> 16) & 0xff;
b[2] = (number >> 8) & 0xff;
b[3] = number & 0xff;
  }
else
  {
b[3] = (number >> 24) & 0xff;
b[2] = (number >> 16) & 0xff;
b[1] = (number >> 8) & 0xff;
b[0] = number & 0xff;
  }

lseek (fd, offset, SEEK_SET);
for (i = 0; i < 4; ++i)
  write (fd, [i], 1);
close (fd);
  }

This is the Poke equivalent:

  fun generate_testdata = (offset,B> off, int<32> endian, int<32> 
number) void:
  {
var fd = open ("testdata.bin");
set_endian (endian);
int<32> @ fd : off = number;
close (fd);
  }

And thanks to the DSL, this scales nicely to more complex structures,
such as an ELF64 relocation instead of a signed 32-bit integer:

  fun generate_testdata = (offset,B> off, int<32> endian, int<32> 
number) void:
  {
type Elf64_RelInfo =
  struct Elf64_Xword
  {
uint<32> r_sym;
uint<32> r_type;
  };

type Elf64_Rela =
  struct
  {
offset,B> r_offset;
Elf64_RelInfo r_info;
offset,B> r_addend;
  };

var fd = open ("got32reloc.bin");
set_endian (endian);
Elf64_Rela @ 0#B
  = Elf64_Rela { r_info = Elf64_RelInfo { r_sym = 0xff00, r_type = 3 } }
close (fd);
  }



Re: GNU Coding Standards, automake, and the recent xz-utils backdoor

2024-04-01 Thread Russ Allbery
"Zack Weinberg"  writes:

> I have been thinking about this incident and this thread all weekend and
> have seen a lot of people saying things like "this is more proof that
> tarballs are a thing of the past and everyone should just build straight
> from git".  There are a bunch of reasons why one might disagree with
> this as a blanket statement, but I do think there's a valid point here:
> the malicious xz maintainer *might* have been caught earlier if they had
> committed the build-to-host.m4 modification to xz's VCS.  (Or they might
> not have!  Witness the three (and counting) malicious patches that they
> barefacedly submitted to *other* software and got accepted because the
> malice was subtle enough to pass through code review.)

> It might indeed be worth thinking about ways to minimize the difference
> between the tarball "make dist" produces and the tarball "git archive"
> produces, starting from the same clean git checkout, and also ways to
> identify and audit those differences.

There is extensive ongoing discussion of this on debian-devel.  There's no
real consensus in that discussion, but I think one useful principle that's
emerged that doesn't disrupt the world *too* much is that the release
tarball should differ from the Git tag only in the form of added files.
Any files that are present in both Git and in the release tarball should
be byte-for-byte identical.  That, in turn, allows distro tooling to
either use the Git tag and regenerate all the generated files, or start
from the release tarball, remove all the added files, and do the same.
But it still preserves an augmented release tarball for people building
from scratch who may not have all of the necessary tools available.

It's not a panacea (there are no panaceas), but it's less aggressive and
disruptive than some other ideas that have been proposed, and I think it's
mostly best practice already.

-- 
Russ Allbery (ea...@eyrie.org) 



Should the GNU Coding Standards make a recommendation about aclocal's `--install` flag? (was: "Re: GNU Coding Standards, automake, and the recent xz-utils backdoor")

2024-04-01 Thread Eric Gallager
On Sun, Mar 31, 2024 at 6:19 PM Peter Johansson  wrote:
>
>
> On 1/4/24 06:00, Eric Gallager wrote:
>
> So, `aclocal` has a flag to control this behavior: specifically, its
> `--install` flag. Right now I don't see `aclocal` mentioned in the GNU
> Coding Standards at all. Should they be updated to include a
> recommendation as to whether it's better to put `--install` in
> `ACLOCAL_AMFLAGS` or not? Or would such a recommendation be a better
> fit for the `automake` manual (since that's where `aclocal` comes
> from)?
>
> A common scenario is that the embedded M4 files are not the latest version 
> and that the code in configure.ac is not compatible with newer versions that 
> might be installed. Setting the --install flag and make every developer 
> bootstrap with 'aclocal --install' or anyone trying to bootstrap an old 
> version of the project would be very fragile. Also 'aclocal --install' only 
> overwrite the embedded copy if the serial numbers in the files suggest the 
> installed file is a newer version than the embedded M4 file.

Note that there's some discussion ongoing on the bug-autoconf and
bug-gnulib mailing lists (which I'm not subscribed to, but will read
via the archives occasionally) regarding whether aclocal's current
handling of serial numbers is the correct way to behave or not, see
for example starting here:
https://lists.gnu.org/archive/html/bug-autoconf/2024-04/msg3.html

>
> Peter



Re: GNU Coding Standards, automake, and the recent xz-utils backdoor

2024-04-01 Thread Eric Gallager
On Mon, Apr 1, 2024 at 2:26 PM Zack Weinberg  wrote:
>
> On Mon, Apr 1, 2024, at 2:04 PM, Russ Allbery wrote:
> > "Zack Weinberg"  writes:
> >> It might indeed be worth thinking about ways to minimize the
> >> difference between the tarball "make dist" produces and the tarball
> >> "git archive" produces, starting from the same clean git checkout,
> >> and also ways to identify and audit those differences.
> >
> > There is extensive ongoing discussion of this on debian-devel. There's
> > no real consensus in that discussion, but I think one useful principle
> > that's emerged that doesn't disrupt the world *too* much is that the
> > release tarball should differ from the Git tag only in the form of
> > added files. Any files that are present in both Git and in the release
> > tarball should be byte-for-byte identical.
>
> That dovetails nicely with something I was thinking about myself.
> Obviously the result of "make dist" should be reproducible except for
> signatures; to the extent it isn't already, those are bugs in automake.
> But also, what if "make dist" produced *two* disjoint tarballs? One of
> which is guaranteed to be byte-for-byte identical to an archive of the
> VCS at the release tag (in some clearly documented fashion; AIUI, "git
> archive" does *not* do what we want).

Thinking about how to implement this: so, currently automake variables
have (at least) 2 special prefixes (that I can think of at the moment)
that control various automake behaviors: "dist" or "nodist" to control
inclusion in the distribution, and "noinst" to prevent installation.
What about a 3rd one of these prefixes: "novcs", to teach automake
about which files belong in VCS or not? i.e. then you might have a
variable name like:
dist_novcs_DATA = foo bar baz
...which would indicate that foo, bar, and baz are data files that
ought to be distributed in the release tarball, but not in the
VCS-based one? Or would it be easier to just teach automake to read
.gitignore files and the like so that it can get that information from
there?

> The other contains all the files that "autoreconf -i" or "./bootstrap.sh"
> or whatever would create, but nothing else.  Diffs could be provided
> for both tarballs, or only for the VCS-archive tarball, whichever turns
> out to be more compact (I can imagine the diff for the generated-files
> tarball turning out to be comparable in size to the generated-files
> tarball itself).
>
> This should make it much easier to find, and therefore audit, the pre-
> generated files, and to validate that there's no overlap. It would add
> an extra step for people who want to build from tarball, without having
> to install autoconf (or whatever) first -- but an easier extra step
> than, y'know, installing autoconf. :)  Conversely, people who want to
> build from tarballs but *not* use the pre-generated configure, etc,
> could now download the 'bare' tarball only.
>
> ("Couldn't those people just build from a git checkout?"  Not if they
> don't have the tooling for it, not during early stages of a distribution
> bootstrap, etc.  Also, the act of publishing a tarball that's a golden
> copy of the VCS at the release tag is valuable for archival purposes.)
>

Agreed on these points.

> zw



Re: GNU Coding Standards, automake, and the recent xz-utils backdoor

2024-04-01 Thread Bruno Haible
Eric Gallager wrote:
> What about a 3rd one of these prefixes: "novcs", to teach automake
> about which files belong in VCS or not? i.e. then you might have a
> variable name like:
> dist_novcs_DATA = foo bar baz
> ...which would indicate that foo, bar, and baz are data files that
> ought to be distributed in the release tarball, but not in the
> VCS-based one?

The maintainer already decides which files to put under version control,
on a per-file basis ('git add' vs. 'git rm'). Why should a maintainer
duplicate this information in a Makefile.am? The lists can then diverge,
leading to hassle.

> Or would it be easier to just teach automake to read
> .gitignore files and the like so that it can get that information from
> there?

Of course, if you want to have a Makefile target that needs the
information whether some file is in VCS, it should use 'git' commands
(such as 'git status') to determine this information. Whether it
additionally should read .gitignore files, can be debated on a case-by-case
basis.

Bruno






Re: GNU Coding Standards, automake, and the recent xz-utils backdoor

2024-04-01 Thread Zack Weinberg
On Mon, Apr 1, 2024, at 2:04 PM, Russ Allbery wrote:
> "Zack Weinberg"  writes:
>> It might indeed be worth thinking about ways to minimize the
>> difference between the tarball "make dist" produces and the tarball
>> "git archive" produces, starting from the same clean git checkout,
>> and also ways to identify and audit those differences.
>
> There is extensive ongoing discussion of this on debian-devel. There's
> no real consensus in that discussion, but I think one useful principle
> that's emerged that doesn't disrupt the world *too* much is that the
> release tarball should differ from the Git tag only in the form of
> added files. Any files that are present in both Git and in the release
> tarball should be byte-for-byte identical.

That dovetails nicely with something I was thinking about myself.
Obviously the result of "make dist" should be reproducible except for
signatures; to the extent it isn't already, those are bugs in automake.
But also, what if "make dist" produced *two* disjoint tarballs? One of
which is guaranteed to be byte-for-byte identical to an archive of the
VCS at the release tag (in some clearly documented fashion; AIUI, "git
archive" does *not* do what we want).  The other contains all the files
that "autoreconf -i" or "./bootstrap.sh" or whatever would create, but
nothing else.  Diffs could be provided for both tarballs, or only for
the VCS-archive tarball, whichever turns out to be more compact (I can
imagine the diff for the generated-files tarball turning out to be
comparable in size to the generated-files tarball itself).

This should make it much easier to find, and therefore audit, the pre-
generated files, and to validate that there's no overlap. It would add
an extra step for people who want to build from tarball, without having
to install autoconf (or whatever) first -- but an easier extra step
than, y'know, installing autoconf. :)  Conversely, people who want to
build from tarballs but *not* use the pre-generated configure, etc,
could now download the 'bare' tarball only.

("Couldn't those people just build from a git checkout?"  Not if they
don't have the tooling for it, not during early stages of a distribution
bootstrap, etc.  Also, the act of publishing a tarball that's a golden
copy of the VCS at the release tag is valuable for archival purposes.)

zw



Re: GNU Coding Standards, automake, and the recent xz-utils backdoor

2024-04-01 Thread Zack Weinberg
On Sun, Mar 31, 2024, at 3:17 AM, Jacob Bachmeyer wrote:
> Eric Gallager wrote:
>> Specifically, what caught my attention was how the release tarball
>> containing the backdoor didn't match the history of the project in its
>> git repository. That made me think about automake's `distcheck`
>> target, whose entire purpose is to make it easier to verify that a
>> distribution tarball can be rebuilt from itself and contains all the
>> things it ought to contain.
>
> The problem is that a release tarball is a freestanding object, with no 
> dependency on the repository from which it was produced.  In this case, 
> the attacker added a bogus "update" of build-to-host.m4 from gnulib to 
> the release tarball, but that file is not stored in the Git repository.  
> This would not have tripped "make distcheck" because the crocked tarball 
> can indeed be used to rebuild another crocked tarball.
>
> As Alexandre Oliva mentioned in his reply, there is not really any good 
> way to prevent this, since the attacker could also patch the generated 
> configure script more directly.

I have been thinking about this incident and this thread all weekend and
have seen a lot of people saying things like "this is more proof that tarballs
are a thing of the past and everyone should just build straight from git".
There are a bunch of reasons why one might disagree with this as a blanket
statement, but I do think there's a valid point here: the malicious xz
maintainer *might* have been caught earlier if they had committed the
build-to-host.m4 modification to xz's VCS.  (Or they might not have!
Witness the three (and counting) malicious patches that they barefacedly
submitted to *other* software and got accepted because the malice was
subtle enough to pass through code review.)

It might indeed be worth thinking about ways to minimize the difference
between the tarball "make dist" produces and the tarball "git archive"
produces, starting from the same clean git checkout, and also ways to
identify and audit those differences.

...
> Maybe the best revision to the GNU Coding Standards would be that 
> releases should, if at all possible, contain only text?  Any binary 
> files needed for testing can be generated during "make check" if 
> necessary

I don't think this is a good idea.  It's only a speed bump for someone
trying to smuggle malicious data into a package (think "base64 -d") and
it makes life substantially harder for honest authors of programs that
work with binary data, and authors of material whose "source code"
(as GPLv3 uses that term) *is* binary data.  Consider pngsuite, for
instance (http://www.schaik.com/pngsuite/) -- it would be a *ton* of
work to convert each of these test PNG files into GNU Poke scripts,
and probably the result would be *less* ergonomic for purposes of
improving the test suite.

I would like to suggest that a more useful policy would be "files
written to $prefix by 'make install' should not have any data
dependency on files labeled as part of the package's testsuite".
That doesn't constrain honest authors and it seems within the
scope of what the reproducible builds people could test for.
(Build the package, install to nonce prefix 1, unpack the tarball
again, delete the test suite, build again, install to prefix 2, compare.)
Of course a sufficiently determined malicious coder could detect
the reproducible-build test environment, but unlike "no binary data"
this is a substantial difficulty increment.

zw



Re: automated release building service

2024-04-01 Thread Alfred M. Szmidt
   * Such an automated release building service is a piece of SaaSS.

CI is not SaaSS, how is it different?

 I can
 hardly imagine how we at GNU tell people "SaaSS is as bad as, or worse
 than, proprietary software" and at the same time advocate the use of
 such a service.

Unnecessary hyperbole and FUD, nobody is caliming anything of the
sort.



Re: automated release building service

2024-04-01 Thread Bruno Haible
Jacob Bachmeyer wrote:
> >> Essentially, this would be an automated release building service:  upon
> >> request, make a Git checkout, run autogen.sh or equivalent, make dist,
> >> and publish or hash the result.  The problem is that an attacker who
> >> manages to gain commit access to a repository may be able to launch
> >> attacks on the release building service, since "make dist" can run
> >> scripts.  The service could probably mount the working filesystem noexec
> >> since preparing source releases should not require running (non-system)
> >> binaries and scripts can be run by directly feeding them into their
> >> interpreters even if the filesystem is mounted noexec, but this still
> >> leaves all available interpreters and system tools potentially available.
> >> 
> >
> > Well, it'd at least make things more difficult for the attacker, even
> > if it wouldn't stop them completely.
> >   
> 
> Actually, no, it would open a *new* target for attackers---the release 
> building service itself.  Mounting the scratchpad noexec would help to 
> complicate attacks on that service, but right now there is *no* central 
> point for an attacker to hit to compromise releases.  If a central 
> release building service were set up, it would be a target, and an 
> attacker able to arrange a persistent compromise of the service could 
> then tamper with later releases as they are built.  This should be 
> fairly easy to catch, if an honest maintainer has a secure environment, 
> ("Why the  does the central release service tarball not match mine?  
> And what the  is the extra code in this diff between its tarball 
> and mine!?") but there is a risk that, especially for large projects, 
> maintainers start relying on the central release service instead of 
> building their own tarballs.
> 
> The problem here was not a maintainer with a compromised system---it 
> seems that "Jia Tan" was a malefactor's sock puppet from the start.

There are several problems that such an automated release building service
would create. Here are a couple of them:

* First of all, it's a statement of mistrust towards the developer/maintainer,
  if developers get pressured into using an automated release building
  service rather than producing the tarballs on their own.
  This demotivates and turns off developers, and it does not fix the
  original problem: If a developer is in fact a malefactor, they can
  also commit malicious code; they don't need to own the release process
  in order to do evil things.

* Such an automated release building service is a piece of SaaSS. I can
  hardly imagine how we at GNU tell people "SaaSS is as bad as, or worse
  than, proprietary software" and at the same time advocate the use of
  such a service.

* Like Jacob mentioned, such a service quickly becomes a target for
  attackers. So, instead of trusting a developer, you now need to trust
  the technical architecture and the maintainers of such a service.

* If this automated release building service is effective in the sense
  that it eliminates evil actions from the developer, it must have extra
  complexity, to allow testing of the tarballs before they get published.
  Think about it: who will set the release tag on the git repository
  and publish that ("git push --tags")?
- If the developer does it, then the developer has the power to
  move the git tag, which implies that the published tarballs
  (from the build service) will not match the contents of the git
  repository at that tag.
- So, it has to be the build service which sets and pushes the
  git tag. But it needs to allow for the possibility to do release
  tarball testing and thus canceling/withdrawing the release before
  it gets published.
  It does get complicated...

* The OpenSSF is already pushing for such a release build service,
  through "OpenSSF scorecards". Summary [1]:
"Scorecard is an automated tool from the OpenSSF that assesses
 19 different vectors with heuristics ("checks") associated with
 important software security aspects and assigns each check
 a score of 0-10.…"
  - They are pretending that their criteria guard against "malicious
maintainers" [2]. However, in the xz case [3] they failed: they
assigned a good score, despite of binary blobs in the repository.
  - Their tool pushes the developers to using GitHub. [2].
  - Their tool makes it clear that such a release build service requires
consideration of "token permissions" and "branch protections" [3].

Bruno

[1] https://openssf.org/
[2] https://securityscorecards.dev/
[3] https://securityscorecards.dev/viewer/?uri=github.com/tukaani-project/xz






Re: libsystemd dependencies

2024-04-01 Thread Bruno Haible
Jacob Bachmeyer wrote:
> some of the blame for this needs to fall on the 
> systemd maintainers and their "katamari" architecture.  There is no good 
> reason for notifications of daemon startup to pull in liblzma, but using 
> libsystemd for that purpose does exactly that, and ended up getting 
> xz-utils targeted as a means of getting to sshd without the OpenSSH 
> maintainers noticing.

The systemd people are working on reducing the libsystemd dependencies:
https://github.com/systemd/systemd/issues/32028

However, the question remains unanswered why it needs 3 different
compression libraries (liblzma, libzstd, and liblz4). Why would one
not suffice?






Re: automated release building service

2024-04-01 Thread Tomas Volf
I am not arguing for the building service, but:

On 2024-04-01 14:40:20 +0200, Bruno Haible wrote:
> * Such an automated release building service is a piece of SaaSS. I can
>   hardly imagine how we at GNU tell people "SaaSS is as bad as, or worse
>   than, proprietary software" and at the same time advocate the use of
>   such a service.

Would it still be SaaSS if the full source code to such a platform was published
under GPL and the only "secret sauce" would be the signing key to certify that
the archive in question was produced by the GNU's (FSF's?) instance of such a
platform?

Have a nice day,
Tomas Volf

--
There are only two hard things in Computer Science:
cache invalidation, naming things and off-by-one errors.


signature.asc
Description: PGP signature