bug#80658: questions about "shuf" utility

2026-03-23 Thread Terence Kelly




Hi Paul & Collin,

Thanks for your quick and detailed replies!

Regarding true-random sources versus PRNGs, I recommend that the shuf 
documentation should educate users about the qualitative difference 
between the two.  Any PRNG that accepts a fixed-length seed inevitably 
throttles the entropy brought to bear on whatever problem the PRNG's 
output is trying to solve (equiprobable selection of a random permutation 
in the case of shuf).  As my "Zero Tolerance" paper points out, a 
20,000-bit seed isn't enough entropy to equiprobably shuffle 2,087 items.


My latest paper presents another argument ("balls into bins") showing why 
*any* PRNG inevitably introduces enormous biases into a related 
combinatorial selection problem:


https://spawn-queue.acm.org/doi/pdf/10.1145/3778029

Cryptographic security is irrelevant to the simple arguments that 
establish that the use of *any* PRNG precludes equiprobability for both 
shuffling and random subset selection (a.k.a. sampling).


One way to present this fact to users would be to say that by using a 
"good" PRNG, you're giving up / relaxing the equiprobability requirement 
--- hopefully in ways that won't harm your application.  Let users make 
fully informed decisions regarding the tradeoffs of true-RNG versus PRNG.


Thanks!

-- Terence


On Mon, 23 Mar 2026, Paul Eggert wrote:


On 2026-03-22 23:30, Collin Funk wrote:


the RDSEED and RDRAND instructions have had some quite severe
issues. E.g. random numbers being stored in a buffer that could be read
from other cores [1].


Yes, we'd need assurance of reliability for the underlying RNG. As I 
understand it, the hardware folks are getting there, in the sense that the 
hardware RNGs are good enough for Coreutils and there are known mitigations 
for the bugs we know about. Likewise for what Coreutils currently uses: by 
default it relies on GNU/Linux getrandom calls despite the presence of 
CVE-2025-0577 .




Recently, AMD has had a
severe issue with RDSEED consistently returning zero [2].


That would be an issue if we used the broken RDSEED implementations. Luckily, 
there's a simple workaround: use 64-bit RDSEED, which is what we should use 
anyway.




Also, I was curious about the use of ISAAC. As far as I can tell it was
chosen for it's speed and being cryptographically secure. I guess I
would have to benchmark it, but I wonder how it compares to ChaCha20 as
used by OpenBSD's and Linux's (among others) /dev/random files. It also
has the benefit of more recent cryptanalysis, and more cryptanalysis in
general since it is used in many protocols, e.g., TLS.


The code uses ISAAC because it was written before ChaCha20 was invented. If 
we had to do it over again now, we'd likely use ChaCha, though perhaps not 
ChaCha20 as Jean-Philippe Aumasson, who looked into ISAAC (see 
coreutils/gl/lib/randread.c) without finding problems that would affect 
current Coreutils, has said that ChaCha8 is both 2.5× faster than ChaCha20 
and good enough for Coreutils-like applications. See:


Aumasson J-P. 6 years after too much crypto. bfSwA. 2025-11-16. 





bug#80658: questions about "shuf" utility

2026-03-23 Thread Paul Eggert

On 2026-03-22 23:30, Collin Funk wrote:


the RDSEED and RDRAND instructions have had some quite severe
issues. E.g. random numbers being stored in a buffer that could be read
from other cores [1].


Yes, we'd need assurance of reliability for the underlying RNG. As I 
understand it, the hardware folks are getting there, in the sense that 
the hardware RNGs are good enough for Coreutils and there are known 
mitigations for the bugs we know about. Likewise for what Coreutils 
currently uses: by default it relies on GNU/Linux getrandom calls 
despite the presence of CVE-2025-0577 
.




Recently, AMD has had a
severe issue with RDSEED consistently returning zero [2].


That would be an issue if we used the broken RDSEED implementations. 
Luckily, there's a simple workaround: use 64-bit RDSEED, which is what 
we should use anyway.




Also, I was curious about the use of ISAAC. As far as I can tell it was
chosen for it's speed and being cryptographically secure. I guess I
would have to benchmark it, but I wonder how it compares to ChaCha20 as
used by OpenBSD's and Linux's (among others) /dev/random files. It also
has the benefit of more recent cryptanalysis, and more cryptanalysis in
general since it is used in many protocols, e.g., TLS.


The code uses ISAAC because it was written before ChaCha20 was invented. 
If we had to do it over again now, we'd likely use ChaCha, though 
perhaps not ChaCha20 as Jean-Philippe Aumasson, who looked into ISAAC 
(see coreutils/gl/lib/randread.c) without finding problems that would 
affect current Coreutils, has said that ChaCha8 is both 2.5× faster than 
ChaCha20 and good enough for Coreutils-like applications. See:


Aumasson J-P. 6 years after too much crypto. bfSwA. 2025-11-16. 







bug#80658: questions about "shuf" utility

2026-03-22 Thread Collin Funk
Hi Paul,

Paul Eggert  writes:

> On 2026-03-22 16:52, Terence Kelly wrote:
>
>> 1. Can you confirm that shuf uses the Fisher-Yates/Durstenfeld
>> unbiased shuffle algorithm?
>
> Never heard of that name for that algorithm, which I consider obvious.
> As I recall, I independently invented a superset of it in 2006 and
> shuf uses this superset algorithm, which is equivalent to
> Fisher-Yates/Durstenfeld in the special case you're surely thinking
> of. See coreutils/gl/lib/randperm.c's randperm_new.

I assume many different people independently "discovered" the algorithm,
and did not know about the previous works. For example, Knuth's first
edition of "The Art of Computer Programming, Volume 2" credits L. E.
Moses and R. V. Oakford for first publishing it in 1964. The second
edition corrects it to credit R. A. Fisher and F. Yates for publishing
it in 1938.

>> Before I can recommend shuf, however, I must confirm that it is free of 
>> several defects that are often found in other random permutation software.
>
> I suggest recommending coreutils 9.6 (2025-01-17) or later, due to the
> bug fixed here (a bug that's not on your list...):
>
> https://cgit.git.savannah.gnu.org/cgit/coreutils.git/commit/?id=bfbb3ec7f798b179d7fa7b42673e068b18048899
>
> The bug doesn't matter if you use --random-source=FILE, as that option
> bypasses the bug.

Not directly related to Terence's questions, but I had a few
questions/thoughts while looking into 'shuf' recently that his mail
reminded me of. I will just share them here because he may also be
interested.

First, I was curious about this comment from lib/randread.c you wrote in
2011:

/* FIXME: Improve performance by adding support for the RDRAND machine
   instruction if available (e.g., Ivy Bridge processors).  */

I wonder if this is still the case nowadays? As Terence's article
mentions the RDSEED and RDRAND instructions have had some quite severe
issues. E.g. random numbers being stored in a buffer that could be read
from other cores [1]. This was supposedly fixed by Intel with microcode,
however it also adds latency to the instruction. Recently, AMD has had a
severe issue with RDSEED consistently returning zero [2].

Also, I was curious about the use of ISAAC. As far as I can tell it was
chosen for it's speed and being cryptographically secure. I guess I
would have to benchmark it, but I wonder how it compares to ChaCha20 as
used by OpenBSD's and Linux's (among others) /dev/random files. It also
has the benefit of more recent cryptanalysis, and more cryptanalysis in
general since it is used in many protocols, e.g., TLS.

Collin

[1] https://download.vusec.net/papers/crosstalk_sp21.pdf
[2] https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7055.html





bug#80658: questions about "shuf" utility

2026-03-22 Thread Paul Eggert

On 2026-03-22 16:52, Terence Kelly wrote:

1. Can you confirm that shuf uses the Fisher-Yates/Durstenfeld unbiased 
shuffle algorithm?


Never heard of that name for that algorithm, which I consider obvious. 
As I recall, I independently invented a superset of it in 2006 and shuf 
uses this superset algorithm, which is equivalent to 
Fisher-Yates/Durstenfeld in the special case you're surely thinking of. 
See coreutils/gl/lib/randperm.c's randperm_new.



2. Can you confirm that shuf avoids modulo bias --- the infamous and 
widespread bug, "random_number%N" --- when it makes equiprobable 
selections in its implementation of the shuffle algorithm?


Yes. See coreutils/gl/lib/randint.c's randint_genmax.


3. Can you confirm that when shuf's "--random-source=FILE" option is 
used, the specified file is the sole source of random bits for all of 
shuf's behavior, and that if the same random source file is used again, 
holding all other shuf options & inputs constant, then shuf will emit 
the same output?


Yes. See coreutils/gl/lib/randread.c's randread_new and its callers.



Before I can recommend shuf, however, I must confirm that it is free of several 
defects that are often found in other random permutation software.


I suggest recommending coreutils 9.6 (2025-01-17) or later, due to the 
bug fixed here (a bug that's not on your list...):


https://cgit.git.savannah.gnu.org/cgit/coreutils.git/commit/?id=bfbb3ec7f798b179d7fa7b42673e068b18048899

The bug doesn't matter if you use --random-source=FILE, as that option 
bypasses the bug.