Hello Pádraig and all, I have some concerns about supporting different algorithm in b2sum. [long email ahead...]
First, usability of multiple algorithms. ---- From a usability POV, it is rarely (never?) the case that making a program multithreaded or parallel, or optimizing for different CPU type changes the output of the program in a user-noticeable way. That is, "gzip" vs "pigz", "bzip2" vs "pbzip2", "make" vs "make -j", "sort" vs "sort --parallel" - all produce the same output from the user POV (I'll note that gzip vs pigz do produce binary different output, but the content from the user POV is equivalent). Similarly, Optimizing for 32bits vs 64bits should never alter the result from a user POV. I'm not saying these optimizations are not worthwhile, but I do worry that in practice they constitute a completely different checksum. Saying in the manual: "blake2bp - This is like ‘blake2b’ but takes advantage of multiple CPU cores." Gives the impression that it is equivalent, and it is not. The warning that is mentioned in the preceding paragraph will not be noticed by most users, and it's not in the "--help" or man page (and even if it was, users would still ignore it). It's one thing if there's a "gotcha" about program behavior which existed for decades, preceding standardization, and we have to live with it for backward-comparability. But this is a new feature, and we should avoid making it confusing.
In a previous message you wrote:
I might also drop the blake2s and blake2sp implementations as 32 bit can get equivalent but somewhat slower functionality with `b2sum -l 256 -a blake2b`
But it is not equivalent - it's a different result: $ echo a | ./src/b2sum -l256 -ablake2b be29a54b934581ab434fde713c16db07c3e0124a371daca7c33588be7526630e - $ echo a | ./src/b2sum -ablake2s 1e61fdb001508ebd3c559545024e7a58a67aeb124308a24bbd13f7ed09d9f2c7 - If by "equivalent" you mean just "happens to be the same length of digest but different value", then I fear many non-tech-savvy users would not be aware of this distinction. Coreutils has a long history of dealing with full spectrum of users: from computer experts and complete novices. I'm sure you have more experience than me about users' confusion, but to me this sounds like inviting troubles. This also means that if someone has a slow 32bit computer which generates blake2 checksum and he uses "-blake2s" (which is the optimal option for him), then he forces any future user of his checksums to use single-core 32bit algorithm "blake2s" regardless of any newer hardware they have. Also, Up until now, the name of the utility (e.g. "sha1sum") was enough to identify the program. So of someone named their checksum files "data.sha1sum" - it was a good hint to use "sha1sum -c" - and the results should be valid. But calling the file "data.b2sum" is not sufficient anymore - there are at least 4 variables of "b2sum". And more variants if we consider "--length"... Second, the "--length" parameter. ---- This is something that will be relevant for future SHA3 and perhaps others, so I think we might want to take a second look and think ahead. Up until now, with the existing utilities md5sum,sha1sum,sha224,sha384,sha512 - the length of the digest in the checksum file was a usability hint regarding which algorithm it is. The length was fixed, and a checksum file with 40 characters what sha1. 32 was md5, 128 was sha512. Whether we like it or not, it was a good hint. With sha3 and blake2, the digest defaults to 512 as well, using "sha512" loses that useful hint - but that's unavoidable. What is a bigger problem is that with variable length digests in the same utility, it becomes much harder to know what are the correct parameters. I think that automatic length detection should be turned on automatically, even without "--tag". Since I also believe that machines should work harder than people, it would be nice if we have an "--autodetect" kind of parameter that will automatically test multiple algorithms based on the given digest length - it just takes more CPU time, but can save some annoyances for users. Third, multi-threaded scalability. --- the multi-threaded version of 'blake2bp' uses fixed number of threads using openMP: #pragma omp parallel shared(S), num_threads(PARALLELISM_DEGREE) With PARALLELISM_DEGREE being 4 for 'blake2bp' and 8 for 'blake2sp'. But the implementation is not really scalable (in the sense that adding more threads will make things faster but identical result). On one hand, reducing PARALLELISM_DEGREE (e.g. to 2) results in a different value, and increasing it (e.g. to 8) results in a segfault (at least on my machine with 4 cores). On the other hand, having fixed 4 cores means 'blake2bp' ALWAYS creates 3 threads (calls "clone(2)" three times), regardless of how many cores the machine actually has. This seems like bad form IMHO. It also means, if I understand the code correctly, that we can never increase the number of threads - it is fixed at four forever, otherwise it will give different checksums. So regardless of future hardware improvements, we're stuck at 4 threads because it was a reasonable value in 2016. Last but not least, OMP_DYNAMIC --- Using a fixed number in "#pragma omp parallel num_threads()" in OpenMP causes the code to ignore OMP_NUM_THREADS, but it still adheres to OMP_DYNAMIC. If set to TRUE, the system doesn't have to provide 4 threads, and the result of the digest is different: On a machine with 2 cores: $ nproc 2 $ seq 10000 | OMP_DYNAMIC=FALSE ./src/b2sum -ablake2bp 824b2157073309456d6c515b062b96da5098f37561a462024931eaea5af524142cb811eb22f7dc2ceeca36dfae108bb71d7aca0d2a3072e8b65f877fe9008a4e - $ seq 10000 | OMP_DYNAMIC=TRUE ./src/b2sum -ablake2bp 6f032b13b235b541c091b4afd9e01159eb5417fbc9194daf90edb21da9170f82eb8dd62a85a14d3884b5d630a318e63f547e9309582967a6c058b1339843d57b - Similar results happen on machines with 4 or more cores if the machine is overloaded. I also suspect, but not sure about this, that the OpenMP allows the implementation to terminate the program if the number of requested threads isn't available and OMP_DYNAMIC=FALSE . To conclude, I recommend implementing one, and only one, blake2b hash variant. Chose the official reference one (if there is one), and stick to it regardless of threads/bit-size. In the future, I'm sure there will be both hardware and software improvements which could speed up the implementation while keeping the exact same algorithm intact for backward compatibility. Thanks for reading so far, regards, - assaf