Re: feature request for coreutils: b2sum

Assaf Gordon Mon, 31 Oct 2016 19:19:58 -0700

Hello Pádraig and all,

I have some concerns about supporting different algorithm in b2sum.
[long email ahead...]


First, usability of multiple algorithms.
----
From a usability POV, it is rarely (never?) the case that making a program 
multithreaded or parallel, or optimizing for different CPU type changes the 
output of the program in a user-noticeable way.

That is, "gzip" vs "pigz", "bzip2" vs "pbzip2", "make" vs "make -j", "sort" vs "sort 
--parallel" - all produce the same output from the user POV (I'll note that gzip vs pigz do produce binary different output, but the content from the user 
POV is equivalent).
Similarly, Optimizing for 32bits vs 64bits should never alter the result from a 
user POV.

I'm not saying these optimizations are not worthwhile, but I do worry that in 
practice they constitute a completely different checksum.
Saying in the manual:
   "blake2bp - This is like ‘blake2b’ but takes advantage of multiple CPU 
cores."
Gives the impression that it is equivalent, and it is not.
The warning that is mentioned in the preceding paragraph will not be noticed by most 
users, and it's not in the "--help" or man page (and even if it was, users 
would still ignore it).
It's one thing if there's a "gotcha" about program behavior which existed for 
decades, preceding standardization, and we have to live with it for 
backward-comparability.
But this is a new feature, and we should avoid making it confusing.

In a previous message you wrote:

I might also drop the blake2s and blake2sp implementations
as 32 bit can get equivalent but somewhat slower functionality with
`b2sum -l 256 -a blake2b`

But it is not equivalent - it's a different result:

$ echo a | ./src/b2sum -l256 -ablake2b
be29a54b934581ab434fde713c16db07c3e0124a371daca7c33588be7526630e -
$ echo a | ./src/b2sum -ablake2s
1e61fdb001508ebd3c559545024e7a58a67aeb124308a24bbd13f7ed09d9f2c7 -

If by "equivalent" you mean just "happens to be the same length of digest but
different value",
then I fear many non-tech-savvy users would not be aware of this distinction.

Coreutils has a long history of dealing with full spectrum of users: from
computer experts and complete novices.
I'm sure you have more experience than me about users' confusion, but to me
this sounds like inviting troubles.

This also means that if someone has a slow 32bit computer which generates blake2 checksum
and he uses "-blake2s" (which is the optimal option for him),
then he forces any future user of his checksums to use single-core 32bit algorithm
"blake2s" regardless of any newer hardware they have.

Also,
Up until now, the name of the utility (e.g. "sha1sum") was enough to identify
the program.
So of someone named their checksum files "data.sha1sum" - it was a good hint to use
"sha1sum -c" - and the results should be valid.
But calling the file "data.b2sum" is not sufficient anymore - there are at least 4
variables of "b2sum".
And more variants if we consider "--length"...

Second, the "--length" parameter.
----
This is something that will be relevant for future SHA3 and perhaps others, so
I think we might want to take a second look and think ahead.
Up until now,
with the existing utilities md5sum,sha1sum,sha224,sha384,sha512 - the length of
the digest in the checksum file was a usability hint regarding which algorithm
it is. The length was fixed, and a checksum file with 40 characters what sha1.
32 was md5, 128 was sha512.

Whether we like it or not, it was a good hint.

With sha3 and blake2, the digest defaults to 512 as well, using "sha512" loses
that useful hint - but that's unavoidable.
What is a bigger problem is that with variable length digests in the same
utility,
it becomes much harder to know what are the correct parameters.
I think that automatic length detection should be turned on automatically, even without
"--tag".

Since I also believe that machines should work harder than people, it would be nice if we
have an "--autodetect" kind of parameter
that will automatically test multiple algorithms based on the given digest
length - it just takes more CPU time,
but can save some annoyances for users.

Third, multi-threaded scalability.
---
the multi-threaded version of 'blake2bp' uses fixed number of threads using
openMP:

#pragma omp parallel shared(S), num_threads(PARALLELISM_DEGREE)

With PARALLELISM_DEGREE being 4 for 'blake2bp' and 8 for 'blake2sp'.

But the implementation is not really scalable (in the sense that adding more
threads will make things faster but identical result).
On one hand, reducing PARALLELISM_DEGREE (e.g. to 2) results in a different
value,
and increasing it (e.g. to 8) results in a segfault (at least on my machine
with 4 cores).

On the other hand, having fixed 4 cores means 'blake2bp' ALWAYS creates 3 threads (calls
"clone(2)" three times),
regardless of how many cores the machine actually has. This seems like bad form
IMHO.

It also means, if I understand the code correctly, that we can never increase
the number of threads - it is fixed at four forever,
otherwise it will give different checksums. So regardless of future hardware
improvements, we're stuck at 4 threads because it was a reasonable value in
2016.

Last but not least, OMP_DYNAMIC
---

Using a fixed number in "#pragma omp parallel num_threads()" in OpenMP causes
the code to ignore OMP_NUM_THREADS,
but it still adheres to OMP_DYNAMIC. If set to TRUE, the system doesn't have to
provide 4 threads,
and the result of the digest is different:

On a machine with 2 cores:

$ nproc
2

$ seq 10000 | OMP_DYNAMIC=FALSE ./src/b2sum -ablake2bp

824b2157073309456d6c515b062b96da5098f37561a462024931eaea5af524142cb811eb22f7dc2ceeca36dfae108bb71d7aca0d2a3072e8b65f877fe9008a4e
-

$ seq 10000 | OMP_DYNAMIC=TRUE ./src/b2sum -ablake2bp

6f032b13b235b541c091b4afd9e01159eb5417fbc9194daf90edb21da9170f82eb8dd62a85a14d3884b5d630a318e63f547e9309582967a6c058b1339843d57b
-

Similar results happen on machines with 4 or more cores if the machine is
overloaded.

I also suspect, but not sure about this, that the OpenMP allows the
implementation to terminate the program
if the number of requested threads isn't available and OMP_DYNAMIC=FALSE .

To conclude,
I recommend implementing one, and only one, blake2b hash variant.
Chose the official reference one (if there is one),
and stick to it regardless of threads/bit-size.
In the future, I'm sure there will be both hardware and software improvements
which could speed up the implementation while keeping the exact same algorithm
intact for backward compatibility.

Thanks for reading so far,

regards,
- assaf

Re: feature request for coreutils: b2sum

Reply via email to