On 02/03/2026 12:32, Leonid Evdokimov wrote:
Hello,

Here is the patch to implement CDC in `split --bytes`. I'm submitting
it for review before proceeding with adding CDC to --line-bytes.

I've tested the patch on x86_64, ppc64be and Apple M1, with gcc and clang.

Texinfo documentation is currently missing.

I'm mostly unsure about the following:

1) autoconf and l10n logic being right, as I'm not familiar with AC/AM
and gettext.

2) embedding PCG RNG into make-buz-table. Is there a better way to
accomplish the goal and is it a better way needed?

3) licensing/authorship headers. There might be guidelines I'm missing.

4) right place for getcachelinesize(). Should it be a separate file
and/or part of gnulib?

5) busy-loop of randperm_new() on random-source being stream of 0xFF.
On one hand, that's a "bug" in randint_choose() and randpem_bound(),
on the other hand - one may say that it's just a foot-shooting case.

6) moving `+1` byte allocation to be specific for lines_split(). I've
not run asan build to test correctness. +1 is there for 35 years and,
seems, lines_split() is the only user of that extra byte, but my eye
might miss something.

7) 40 MiB limit for 32-bit CDC hashes, it's tempting to say "42 MB".
Should we? :-)

I've tried to add enough comments to make the code easy to understand,
but I can add more if that's helpful as the memory is still fresh.

The patch patch is also available at github:
https://github.com/coreutils/coreutils/compare/master...darkk:coreutils:cdc


Following on from ...
https://lists.gnu.org/archive/html/coreutils/2025-01/msg00028.html
https://lists.gnu.org/archive/html/coreutils/2026-02/msg00106.html

Some comments from cursory glance:

The INTEL_JCC_ERRATUM stuff may be more generally applicable,
and more appropriate for a separate patch.

same_bytes_() should use `openssl version || skip_`
at least for documentation reasons.

It's better to use 'cksum -a bs2um' than the deprecated `b2sum` in tests.

Was the issue with errnos in getlimits, too noisy logs?
We could disable tracing in getlimits_ if that was the case.

It would be good to augment the "invalid rolling hash window"
error with a valid range for the selected hash.
Oh right I see there are other validations later on.
Anyway tt would be good to augment this initial error if possible.

You will also need to assign copyright for a change of this size.
The process is described undef "Copyright Assignment" in the HACKING file.

thanks for working on this!
Padraig

Reply via email to