[PATCH] split: add content-defined chunking to --bytes

Leonid Evdokimov Mon, 02 Mar 2026 04:33:59 -0800

Hello,

Here is the patch to implement CDC in `split --bytes`. I'm submitting
it for review before proceeding with adding CDC to --line-bytes.


I've tested the patch on x86_64, ppc64be and Apple M1, with gcc and clang.

Texinfo documentation is currently missing.

I'm mostly unsure about the following:

1) autoconf and l10n logic being right, as I'm not familiar with AC/AM
and gettext.

2) embedding PCG RNG into make-buz-table. Is there a better way to
accomplish the goal and is it a better way needed?

3) licensing/authorship headers. There might be guidelines I'm missing.

4) right place for getcachelinesize(). Should it be a separate file
and/or part of gnulib?

5) busy-loop of randperm_new() on random-source being stream of 0xFF.
On one hand, that's a "bug" in randint_choose() and randpem_bound(),
on the other hand - one may say that it's just a foot-shooting case.

6) moving `+1` byte allocation to be specific for lines_split(). I've
not run asan build to test correctness. +1 is there for 35 years and,
seems, lines_split() is the only user of that extra byte, but my eye
might miss something.

7) 40 MiB limit for 32-bit CDC hashes, it's tempting to say "42 MB".
Should we? :-)

I've tried to add enough comments to make the code easy to understand,
but I can add more if that's helpful as the memory is still fresh.

The patch patch is also available at github:
https://github.com/coreutils/coreutils/compare/master...darkk:coreutils:cdc

-- 
WBRBW, Leonid Evdokimov, https://darkk.net.ru tel:+79816800702
PGP: 6691 DE6B 4CCD C1C1 76A0  0D4A E1F2 A980 7F50 FAB2

From 1a5b5fd200e243c4724f116a386314e887e64546 Mon Sep 17 00:00:00 2001
From: Leonid Evdokimov <[email protected]>
Date: Sun, 1 Mar 2026 23:26:42 +0300
Subject: [PATCH 1/3] split: add content-defined chunking to --bytes=SIZE
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Content-defined chunking might be done with SIZE specifying HASH
function, average and maximum sizes of chunks:

 HASH/N       N bytes on average using HASH for content-defined chunking
 HASH[W]/N    N bytes on average while hashing sliding window of W bytes
 HASH/N/K     N bytes on average but K bytes at most
 HASH[W]/N/K  like HASH[W]/N and HASH/N/K combined

Supported HASH functions are gear32, gear64, buz32 and buz64.  BUZHash
supports arbitrary sliding window size, GearHash window is fixed.
Alternative seed value for HASH functions might be specified with
‘--random-source=FILE’.  Size of FILE needed depends on HASH function:

 gear32  1KiB, unless entropy is completely absent
 gear64  2KiB
 buz32   5KiB, with a little bit of luck
 buz64   10KiB

Truly bad luck comes from an entropy source streaming nothing but 0xFF.
---
 .gitignore                        |   1 +
 cfg.mk                            |   4 +-
 configure.ac                      |  34 ++
 init.cfg                          |  19 +
 src/local.mk                      |  35 +-
 src/make-buz-table.c              | 220 ++++++++++
 src/split.c                       | 657 +++++++++++++++++++++++++++++-
 src/split_cdc.c                   | 150 +++++++
 src/split_cdc.h                   |  75 ++++
 tests/local.mk                    |   3 +
 tests/split/bytes-cdc-offbyone.sh |  73 ++++
 tests/split/bytes-cdc.sh          | 103 +++++
 tests/split/exp-extseed-buz32     | 112 +++++
 tests/split/exp-extseed-buz32-42  | 122 ++++++
 tests/split/exp-extseed-buz32-512 | 127 ++++++
 tests/split/exp-extseed-buz64     | 122 ++++++
 tests/split/exp-extseed-buz64-42  | 139 +++++++
 tests/split/exp-extseed-buz64-512 | 131 ++++++
 tests/split/exp-extseed-gear32    | 129 ++++++
 tests/split/exp-extseed-gear64    | 138 +++++++
 tests/split/exp-intseed-buz32     | 123 ++++++
 tests/split/exp-intseed-buz32-42  | 132 ++++++
 tests/split/exp-intseed-buz32-512 | 130 ++++++
 tests/split/exp-intseed-buz64     | 135 ++++++
 tests/split/exp-intseed-buz64-42  | 129 ++++++
 tests/split/exp-intseed-buz64-512 | 139 +++++++
 tests/split/exp-intseed-gear32    | 124 ++++++
 tests/split/exp-intseed-gear64    | 142 +++++++
 tests/split/random-source.sh      |  71 ++++
 29 files changed, 3498 insertions(+), 21 deletions(-)
 create mode 100644 src/make-buz-table.c
 create mode 100644 src/split_cdc.c
 create mode 100644 src/split_cdc.h
 create mode 100755 tests/split/bytes-cdc-offbyone.sh
 create mode 100755 tests/split/bytes-cdc.sh
 create mode 100644 tests/split/exp-extseed-buz32
 create mode 100644 tests/split/exp-extseed-buz32-42
 create mode 100644 tests/split/exp-extseed-buz32-512
 create mode 100644 tests/split/exp-extseed-buz64
 create mode 100644 tests/split/exp-extseed-buz64-42
 create mode 100644 tests/split/exp-extseed-buz64-512
 create mode 100644 tests/split/exp-extseed-gear32
 create mode 100644 tests/split/exp-extseed-gear64
 create mode 100644 tests/split/exp-intseed-buz32
 create mode 100644 tests/split/exp-intseed-buz32-42
 create mode 100644 tests/split/exp-intseed-buz32-512
 create mode 100644 tests/split/exp-intseed-buz64
 create mode 100644 tests/split/exp-intseed-buz64-42
 create mode 100644 tests/split/exp-intseed-buz64-512
 create mode 100644 tests/split/exp-intseed-gear32
 create mode 100644 tests/split/exp-intseed-gear64
 create mode 100755 tests/split/random-source.sh

diff --git .gitignore .gitignore
index b6a576d56..6d48ac05c 100644
--- .gitignore
+++ .gitignore
@@ -194,6 +194,7 @@
 /po/remove-potcdate.sed
 /po/remove-potcdate.sin
 /po/stamp-po
+/src/buz-seed.c
 /src/coreutils.h
 /src/coreutils_shebangs
 /src/coreutils_symlinks
diff --git cfg.mk cfg.mk
index 27b63f93b..03ddf4056 100644
--- cfg.mk
+++ cfg.mk
@@ -907,11 +907,11 @@ update-copyright-env = \
 exclude_file_name_regexp--sc_space_tab = \
   ^(tests/pr/|tests/nl/nl\.sh$$|gl/.*\.diff$$|man/help2man$$)
 exclude_file_name_regexp--sc_bindtextdomain = \
-  ^(gl/.*|lib/euidaccess-stat|src/make-prime-list|src/cksum_crc)\.c$$
+  ^(gl/.*|lib/euidaccess-stat|src/make-prime-list|src/make-buz-table|src/cksum_crc)\.c$$
 exclude_file_name_regexp--sc_trailing_blank = \
   ^(tests/pr/|gl/.*\.diff$$|man/help2man)
 _x_system_h := (system|copy|chown-core|find-mount-point)\.h
-_x_system_c := (libstdbuf|make-prime-list)\.c
+_x_system_c := (libstdbuf|make-prime-list|make-buz-table)\.c
 exclude_file_name_regexp--sc_system_h_headers = \
   ^src/($(_x_system_h)|$(_x_system_c))$$
 
diff --git configure.ac configure.ac
index fdf8d067f..b14ed4c38 100644
--- configure.ac
+++ configure.ac
@@ -213,6 +213,13 @@ if test $gl_gcc_warnings != no; then
     [# -fanalyzer and related options slow GCC considerably.
      ew="$ew -fanalyzer -Wno-analyzer-malloc-leak"])
 
+  if test $gl_gcc_warnings = expensive \
+      && test -d "$srcdir"/.git && \
+      ! test -f "$srcdir"/.tarball-version
+  then
+    AC_DEFINE([DEBUG_EXPENSIVE], [1], [enable expensive run-time tests])
+  fi
+
   # This, $nw, is the list of warnings we disable.
   nw=$ew
   nw="$nw -Wstack-protector"        # not worth working around for pre GCC 15
@@ -613,6 +620,33 @@ if test $utils_cv_brain_16_bit_supported = yes; then
   AC_DEFINE([BF16_SUPPORTED], [1], [Brain 16 bit float supported])
 fi
 
+AC_DEFUN([coreutils_DUMMY_2],
+[
+  AC_REQUIRE([AC_CANONICAL_HOST])
+  INTEL_JCC_ERRATUM=
+  case "$host_cpu" in
+    i386|i486|i586|i686|i786|x86_64) # subset from gl_HOST_CPU_C_ABI
+      gl_COMPILER_CLANG
+      if test "$gl_cv_compiler_clang" = yes; then
+        gl_COMPILER_OPTION_IF([-mbranches-within-32B-boundaries],
+          [INTEL_JCC_ERRATUM='-mbranches-within-32B-boundaries'],
+          [INTEL_JCC_ERRATUM=])
+      elif "$is_msvc"; then
+        gl_COMPILER_OPTION_IF([/QIntel-jcc-erratum],
+          [INTEL_JCC_ERRATUM='/QIntel-jcc-erratum'],
+          [INTEL_JCC_ERRATUM=])
+      else
+        # gcc
+        gl_COMPILER_OPTION_IF([[-Wa,-mbranches-within-32B-boundaries]],
+          [INTEL_JCC_ERRATUM='-Wa,-mbranches-within-32B-boundaries'],
+          [INTEL_JCC_ERRATUM=])
+      fi
+      ;;
+  esac
+  AC_SUBST([INTEL_JCC_ERRATUM])
+])
+coreutils_DUMMY_2
+
 ac_save_CFLAGS=$CFLAGS
 CFLAGS="-march=armv8-a+crypto $CFLAGS"
 AC_MSG_CHECKING([if vmull intrinsic exists])
diff --git init.cfg init.cfg
index ae02adcfd..8f0e7b301 100644
--- init.cfg
+++ init.cfg
@@ -858,4 +858,23 @@ sanitizer_build_()
   grep '[Ss]anitizer' >/dev/null
 }
 
+# These magic 128-bit numbers come from the following sleeve:
+#   git cat-file commit v9.10 | b2sum \
+#     | split -b32 --filter 'echo =$FILE=; cat; echo'
+seed_a=113f40bb9dbbcef422de2da58a9fa44c # AES Key for same_*
+seed_b=72680675caba1a8bdd037bbfda4a841a # Default IV for same_*
+seed_c=e863ae0075498eacfe1e9ee3e23d6ee1 # IV for random-source data
+seed_d=3d3459cccbb97aa6628d76669b0626f6
+
+zero_bytes_ ()
+{
+  head --bytes "$1" /dev/zero
+}
+
+# Print the same ${1} pseudo-random bytes with optional 128-bit hex IV in ${2}.
+same_bytes_ ()
+{
+  zero_bytes_ "$1" | openssl enc -e -AES-128-CTR -K $seed_a -iv ${2:-$seed_b}
+}
+
 sanitize_path_
diff --git src/local.mk src/local.mk
index bf88f7d0e..2c84840df 100644
--- src/local.mk
+++ src/local.mk
@@ -73,6 +73,8 @@ EXTRA_DIST +=		\
   src/dircolors.hin	\
   src/make-prime-list.c	\
   src/primes.h		\
+  src/make-buz-table.c	\
+  src/buz-seed.c	\
   src/crctab.c		\
   src/tac-pipe.c	\
   src/extract-magic	\
@@ -424,7 +426,22 @@ src_arch_SOURCES = src/uname.c src/uname-arch.c
 src_cut_SOURCES = src/cut.c src/set-fields.c
 src_numfmt_SOURCES = src/numfmt.c src/set-fields.c
 
-src_split_SOURCES = src/split.c src/temp-stream.c
+# split_cdc.c demands INTEL_JCC_ERRATUM fixes on x86_64 as it _might(!)_
+# get 40% preformance penalty otherwise... depending on final code layout!
+# The loops are very tight there and degradation from DSB to MITE instruction
+# decoder feels very real at least on Skylake. One can benchmark the issue
+# on Linux with perf(1), however the issue is likely cross-platform:
+#   $ perf stat -e task-clock,cycles,instructions,idq.mite_uops,idq.dsb_uops \
+#       split ...
+# See https://www.intel.com/content/www/us/en/content-details/841076/intel-mitigations-for-jump-conditional-code-erratum.html
+# The "lucky" code layout is very possible even without INTEL_JCC_ERRATUM
+# flags, so A/B benchmarking is not terrible, but it's not trivial as well.
+src_split_SOURCES = src/split.c src/temp-stream.c src/buz-seed.c
+src_split_LDADD += src/libsplit_cdc.a
+noinst_LIBRARIES += src/libsplit_cdc.a
+src_libsplit_cdc_a_SOURCES = src/split_cdc.c
+src_libsplit_cdc_a_CFLAGS = $(INTEL_JCC_ERRATUM) $(AM_CFLAGS)
+
 src_tac_SOURCES = src/tac.c src/temp-stream.c
 
 src_tail_SOURCES = src/tail.c src/iopoll.c
@@ -632,6 +649,22 @@ $(top_srcdir)/src/crctab.c: $(top_srcdir)/src/cksum_crc.c
 	  && rm -rf $(top_srcdir)/src/crctab-tmp; \
 	fi
 
+# Default buz-seed.c is also built like primes.h and crctab.
+BUILT_SOURCES += $(top_srcdir)/src/buz-seed.c
+$(top_srcdir)/src/buz-seed.c: $(top_srcdir)/src/make-buz-table.c
+	$(AM_V_GEN)if test -n '$(BUILD_CC)'; then \
+	  $(MKDIR_P) $(top_srcdir)/src/buz-tmp \
+	  && (cd $(top_srcdir)/src/buz-tmp \
+	      && $(BUILD_CC) $(BUILD_CPPFLAGS) $(BUILD_CFLAGS) \
+		$(BUILD_LDFLAGS) -o make-buz-table$(EXEEXT) \
+		$(abs_top_srcdir)/src/make-buz-table.c) \
+	  && rm -f $@ $@-t \
+	  && $(top_srcdir)/src/buz-tmp/make-buz-table$(EXEEXT) > $@-t \
+	  && chmod a-w $@-t \
+	  && mv $@-t $@ \
+	  && rm -rf $(top_srcdir)/src/buz-tmp; \
+	fi
+
 # false exits nonzero even with --help or --version.
 # test doesn't support --help or --version.
 # Tell automake to exempt then from that installcheck test.
diff --git src/make-buz-table.c src/make-buz-table.c
new file mode 100644
index 000000000..478851f50
--- /dev/null
+++ src/make-buz-table.c
@@ -0,0 +1,220 @@
+/* Generating BUZHash and GearHash S-boxes from a random seed.
+
+   Contributed to the GNU project by Leonid Evdokimov.
+
+   Copyright (C) 2026 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+/* BUZHash substitution table should be balanced according to Robert Uzgalis.
+   Each bit should have exactly 128 zeros and ones.  So, each bit of BUZTable
+   needs log2(256! / 128! / 128!) ~251.67 bits of entropy.  Whole 64-bit
+   BUZHash S-box needs at least ~16107 entropy bits.  However, this code
+   depends on sampling with rejection and uses permutation instead of selection
+   to construct the table, so it needs much higher number of bytes as an input.
+
+   GearHash paper places no explicit demand on the S-box being balanced.
+   Reusing the same table for BUZHash and GearHash seems to be okay as the size
+   of PRNG seed used to generate built-in seed has way less bits than either
+   16107 or 16384, so _some_ bias is practically unavoidable.
+
+   The file embeds stripped-down PCG PRNG as the output should be stable
+   across releases.  Seeded ISAAC is unsuitable as it works differently
+   on 32-bit and 64-bit platforms.  Bundled BLAKE2 code does not include
+   BLAKE2X XOF.  Depending on `openssl` binary to provide a canonical random
+   stream in a build time looks like overkill.
+
+   The generated table is 64-bit assuming uintmax_t to be uint64_t.  There might
+   be machines having 32-bit uintmax_t, but it's unclear if those are still
+   operational and are capable of using modern coreutils.  Please, be kind
+   to specify a test platform if you find a machine & compiler like that.
+
+   Bits are generated from the LSB to MSB.  So, unsigned __int128 support might
+   extend GearHash window to 128 bytes in future releases while maintaining
+   compatibility.  */
+
+#include <inttypes.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <error.h>
+#include <limits.h>
+
+enum { N_CHARS = UCHAR_MAX + 1 };
+
+typedef unsigned char randchar;
+
+typedef struct
+{
+  uint64_t state;
+  uint64_t inc;
+} pcg32_random_t;
+
+static void pcg32_srandom_r (pcg32_random_t *rng, uint64_t initstate,
+                             uint64_t initseq);
+static uint32_t pcg32_random_r (pcg32_random_t *rng);
+static uint32_t pcg32_boundedrand_r (pcg32_random_t *rng, uint32_t bound);
+
+static void
+randpermchar (randchar *v, pcg32_random_t *src, size_t h, size_t n)
+{
+  for (size_t i = 0; i < n; i++)
+    v[i] = i;
+  for (size_t i = 0; i < h; i++)
+    {
+      size_t const j = i + pcg32_boundedrand_r (src, n - i);
+      randchar const tmp = v[i];
+      v[i] = v[j];
+      v[j] = tmp;
+    }
+}
+
+int
+main (int argc, char **argv)
+{
+  if (argc != 1)
+    {
+      fprintf (stderr,
+               "Usage: %s\n"
+               "Produces BUZHash and GearHash substitution tables\n",
+               argv[0]);
+      return EXIT_FAILURE;
+    }
+
+  /* Seed comes from commit 89b2cd58ac895e3fc0d24d8f10e7e4ba132e7fb6 (v9.10) */
+  pcg32_random_t rng;
+  pcg32_srandom_r (&rng, UINT64_C (0x89b2cd58ac895e3f),
+                   UINT64_C (0xc0d24d8f10e7e4ba));
+
+  uint64_t buz_table64[N_CHARS];
+  memset (buz_table64, 0, sizeof (buz_table64));
+  for (unsigned bit = 0; bit < sizeof (buz_table64[0]) * CHAR_BIT; bit++)
+    {
+      randchar perm[N_CHARS];
+      randpermchar (perm, &rng, N_CHARS / 2, N_CHARS);
+      uint64_t const buzbit = UINT64_C (1) << bit;
+      for (unsigned c = 0; c < N_CHARS / 2; c++)
+        buz_table64[perm[c]] |= buzbit;
+    }
+
+  puts ("/* Generated file -- DO NOT EDIT */\n"
+        "#include <config.h>\n"
+        "#include <stdint.h>\n"
+        "#include \"split_cdc.h\"\n");
+  printf (
+      "alignas (CDC_TABLE_DEFAULT_ALIGNAS) uint64_t const buz_seed[%zu] = {\n",
+      (size_t)N_CHARS);
+  for (unsigned c = 0; c < N_CHARS; c++)
+    printf ("  UINT64_C (0x%016" PRIx64 "),\n", buz_table64[c]);
+  puts ("};");
+
+  if (ferror (stdout) || fclose (stdout))
+    error (EXIT_FAILURE, errno, "write error");
+
+  return EXIT_SUCCESS;
+}
+
+/*
+ * PCG Random Number Generation for C.
+ *
+ * Copyright 2014 Melissa O'Neill <[email protected]>
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ *
+ * For additional information about the PCG random number generation scheme,
+ * including its license and other licensing options, visit
+ *
+ *       http://www.pcg-random.org
+ */
+
+/*
+ * This code is derived from the full C implementation, which is in turn
+ * derived from the canonical C++ PCG implementation. The C++ version
+ * has many additional features and is preferable if you can use C++ in
+ * your project.
+ */
+
+// pcg32_srandom_r(rng, initstate, initseq):
+//     Seed the rng.  Specified in two parts, state initializer and a
+//     sequence selection constant (a.k.a. stream id)
+
+static
+void pcg32_srandom_r (pcg32_random_t* rng, uint64_t initstate, uint64_t initseq)
+{
+    rng->state = 0U;
+    rng->inc = (initseq << 1u) | 1u;
+    pcg32_random_r (rng);
+    rng->state += initstate;
+    pcg32_random_r (rng);
+}
+
+// pcg32_random_r(rng)
+//     Generate a uniformly distributed 32-bit random number
+
+static
+uint32_t pcg32_random_r (pcg32_random_t* rng)
+{
+    uint64_t oldstate = rng->state;
+    rng->state = oldstate * UINT64_C (6364136223846793005) + rng->inc;
+    uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;
+    uint32_t rot = oldstate >> 59u;
+    return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));
+}
+
+// pcg32_boundedrand_r(rng, bound):
+//     Generate a uniformly distributed number, r, where 0 <= r < bound
+
+static
+uint32_t pcg32_boundedrand_r (pcg32_random_t* rng, uint32_t bound)
+{
+    // To avoid bias, we need to make the range of the RNG a multiple of
+    // bound, which we do by dropping output less than a threshold.
+    // A naive scheme to calculate the threshold would be to do
+    //
+    //     uint32_t threshold = 0x100000000ull % bound;
+    //
+    // but 64-bit div/mod is slower than 32-bit div/mod (especially on
+    // 32-bit platforms).  In essence, we do
+    //
+    //     uint32_t threshold = (0x100000000ull-bound) % bound;
+    //
+    // because this version will calculate the same modulus, but the LHS
+    // value is less than 2^32.
+
+    uint32_t threshold = -bound % bound;
+
+    // Uniformity guarantees that this loop will terminate.  In practice, it
+    // should usually terminate quickly; on average (assuming all bounds are
+    // equally likely), 82.25% of the time, we can expect it to require just
+    // one iteration.  In the worst case, someone passes a bound of 2^31 + 1
+    // (i.e., 2147483649), which invalidates almost 50% of the range.  In
+    // practice, bounds are typically small and only a tiny amount of the range
+    // is eliminated.
+    for (;;) {
+        uint32_t r = pcg32_random_r (rng);
+        if (r >= threshold)
+            return r % bound;
+    }
+}
diff --git src/split.c src/split.c
index e5fd0ae2e..474141c30 100644
--- src/split.c
+++ src/split.c
@@ -27,6 +27,7 @@
 #include <sys/types.h>
 #include <sys/wait.h>
 #include <spawn.h>
+#include <endian.h>
 
 #include "system.h"
 #include "alignalloc.h"
@@ -37,12 +38,14 @@
 #include "full-write.h"
 #include "ioblksize.h"
 #include "quote.h"
+#include "randperm.h"
 #include "sig2str.h"
 #include "sys-limits.h"
 #include "temp-stream.h"
 #include "xbinary-io.h"
 #include "xdectoint.h"
 #include "xstrtol.h"
+#include "split_cdc.h"
 
 /* The official name of this program (e.g., no 'g' prefix).  */
 #define PROGRAM_NAME "split"
@@ -51,6 +54,10 @@
   proper_name_lite ("Torbjorn Granlund", "Torbj\303\266rn Granlund"), \
   proper_name ("Richard M. Stallman")
 
+#define ARRAY_SIZE(a) (sizeof (a) / sizeof ((a)[0]))
+
+enum { N_CHARS = UCHAR_MAX + 1 };
+
 /* Shell command to filter through, instead of creating files.  */
 static char const *filter_command;
 
@@ -113,13 +120,77 @@ static bool unbuffered;
 /* The character marking end of line.  Defaults to \n below.  */
 static int eolchar = -1;
 
+/* Lookup table mapping u8 to u32|u64 for Cdc_type functions.  */
+void const *cdc_table;
+
+/* Precomputing window-dependent unbuz table speeds BUZHash up measurably.
+   buz32: -35% cycles, buz64: -26% cycles @ Intel Core i7-6600U (Skylake) */
+void const *unbuz_table;
+
+/* Some assertions are measurably expensive to check.  NDEBUG is nice idea, but
+   some downstreams keep NDEBUG unset, so user should not pay for tests.  */
+static const bool TEST =
+#if defined DEBUG_EXPENSIVE
+    true
+#else
+    false
+#endif
+    ;
+
 /* The split mode to use.  */
 enum Split_type
 {
   type_undef, type_bytes, type_byteslines, type_lines, type_digits,
+  type_bytes_cdc,
   type_chunk_bytes, type_chunk_lines, type_rr
 };
 
+enum Cdc_type
+{
+  cdc_buz32, cdc_buz64, cdc_gear32, cdc_gear64,
+  cdc_undef = -1 /* undef is the last one */
+};
+
+static char const *cdc_names[] = { "buz32", "buz64", "gear32", "gear64" };
+
+static bool
+cdc_isbuz (enum Cdc_type hash)
+{
+  return hash == cdc_buz32 || hash == cdc_buz64;
+}
+
+static bool
+cdc_isgear (enum Cdc_type hash)
+{
+  return hash == cdc_gear32 || hash == cdc_gear64;
+}
+
+static bool
+cdc_is64 (enum Cdc_type hash)
+{
+  return hash == cdc_gear64 || hash == cdc_buz64;
+}
+
+static bool
+cdc_is32 (enum Cdc_type hash)
+{
+  return hash == cdc_gear32 || hash == cdc_buz32;
+}
+
+/* TODO: should getcachelinesize() be moved next to getpagesize() ?  */
+static idx_t
+getcachelinesize (void)
+{
+#ifdef LEVEL1_DCACHE_LINESIZE
+  idx_t cacheline_size = sysconf (LEVEL1_DCACHE_LINESIZE);
+#else
+  idx_t cacheline_size = -1;
+#endif
+  if (cacheline_size < 0)
+    cacheline_size = CDC_TABLE_DEFAULT_ALIGNAS;
+  return cacheline_size;
+}
+
 /* For long options that have no equivalent short option, use a
    non-character as a pseudo short option, starting with CHAR_MAX + 1.  */
 enum
@@ -127,7 +198,8 @@ enum
   VERBOSE_OPTION = CHAR_MAX + 1,
   FILTER_OPTION,
   IO_BLKSIZE_OPTION,
-  ADDITIONAL_SUFFIX_OPTION
+  ADDITIONAL_SUFFIX_OPTION,
+  RANDOM_SOURCE_OPTION
 };
 
 static struct option const longopts[] =
@@ -144,6 +216,8 @@ static struct option const longopts[] =
   {"numeric-suffixes", optional_argument, NULL, 'd'},
   {"hex-suffixes", optional_argument, NULL, 'x'},
   {"filter", required_argument, NULL, FILTER_OPTION},
+  {"random-source", required_argument, NULL,
+   RANDOM_SOURCE_OPTION},
   {"verbose", no_argument, NULL, VERBOSE_OPTION},
   {"separator", required_argument, NULL, 't'},
   {"-io-blksize", required_argument, NULL,
@@ -244,7 +318,7 @@ default size is 1000 lines, and default PREFIX is 'x'.\n\
 "));
       oputs (_("\
   -b, --bytes=SIZE\n\
-         put SIZE bytes per output file\n\
+         put SIZE bytes per output file; see explanation below\n\
 "));
       oputs (_("\
   -C, --line-bytes=SIZE\n\
@@ -290,6 +364,10 @@ default size is 1000 lines, and default PREFIX is 'x'.\n\
       oputs (_("\
   -u, --unbuffered\n\
          immediately copy input to output with '-n r/...'\n\
+"));
+      oputs (_("\
+      --random-source=FILE\n\
+         get random seed for content-defined chunking from FILE\n\
 "));
       oputs (_("\
       --verbose\n\
@@ -299,6 +377,15 @@ default size is 1000 lines, and default PREFIX is 'x'.\n\
       oputs (VERSION_OPTION_DESCRIPTION);
       emit_size_note ();
       fputs (_("\n\
+SIZE may also be:\n\
+  N            N bytes with optional unit\n\
+  HASH/N       N bytes on average using HASH for content-defined chunking,\n\
+               HASH may be gear32, gear64, buz32 and buz64\n\
+  HASH[W]/N    N bytes on average while hashing sliding window of W bytes\n\
+  HASH/N/K     N bytes on average but K bytes at most\n\
+  HASH[W]/N/K  like HASH[W]/N and HASH/N/K combined\n\
+"), stdout);
+      fputs (_("\n\
 CHUNKS may be:\n\
   N       split into N files based on size of input\n\
   K/N     output Kth of N to standard output\n\
@@ -781,6 +868,268 @@ bytes_split (intmax_t n_bytes, intmax_t rem_bytes,
     cwrite (true, NULL, 0);
 }
 
+extern void
+buz32 (void *phash_, unsigned char const *p, idx_t count)
+{
+  uint32_t *const phash = phash_;
+  uint32_t const *const buz = cdc_table;
+  uint32_t hash = *phash;
+  affirm (count >= 1);
+  for (unsigned char const *end = p + count; p != end; p++)
+    hash = rotl32 (hash, 1) ^ buz[*p];
+  *phash = hash;
+}
+
+extern void
+buz64 (void *phash_, unsigned char const *p, idx_t count)
+{
+  affirm (count >= 1);
+  uint64_t *const phash = phash_;
+  uint64_t const *const buz = cdc_table;
+  uint64_t hash = *phash;
+  for (unsigned char const *end = p + count; p != end; p++)
+    hash = rotl64 (hash, 1) ^ buz[*p];
+  *phash = hash;
+}
+
+extern void
+gear32 (void *phash_, unsigned char const *p, idx_t count)
+{
+  affirm (count >= 1);
+  uint32_t *const phash = phash_;
+  uint32_t const *const cdc = cdc_table;
+  uint32_t hash = *phash;
+  for (unsigned char const *end = p + count; p != end; p++)
+    hash = (hash << 1) + cdc[*p];
+  *phash = hash;
+}
+
+extern void
+gear64 (void *phash_, unsigned char const *p, idx_t count)
+{
+  affirm (count >= 1);
+  uint64_t *const phash = phash_;
+  uint64_t const *const cdc = cdc_table;
+  uint64_t hash = *phash;
+  for (unsigned char const *end = p + count; p != end; p++)
+    hash = (hash << 1) + cdc[*p];
+  *phash = hash;
+}
+
+static char const*
+gear32_terminator_alloc (void)
+{
+  uint32_t const *const cdc = cdc_table;
+
+  /* Find a character in CDC table with the LSB being 0-bit and 1-bit.  */
+  unsigned char zero, one;
+  idx_t i;
+  for (i = 0; i < N_CHARS && (cdc[i] & 1) != 0; i++)
+    ;
+  zero = i;
+  for (i = 0; i < N_CHARS && (cdc[i] & 1) != 1; i++)
+    ;
+  one = i;
+  /* Chance of the LSB being equal across 256 random values is 2^-255.  */
+  if ((cdc[zero] & 1) != 0 || (cdc[one] & 1) != 1)
+    error (EXIT_FAILURE, 0, _("low-entropy --random-source"));
+
+  uint32_t hash = 0; /* Zero is the target value */
+  idx_t const window = sizeof (hash) * CHAR_WIDTH;
+  char *t = xmalloc (window);
+  for (i = window - 1; i >= 0; i--)
+    {
+      unsigned char c = (hash & 1) ? one : zero;
+      t[i] = (char)c;
+      hash = (hash - cdc[c]) >> 1;
+    }
+  if (TEST)
+    assure ((hash = 0, gear32 (&hash, (void *)t, window), hash == 0));
+  return t;
+}
+
+static char const*
+gear64_terminator_alloc (void)
+{
+  uint64_t const *const cdc = cdc_table;
+
+  /* Find a character in CDC table with the LSB being 0-bit and 1-bit.  */
+  unsigned char zero, one;
+  idx_t i;
+  for (i = 0; i < N_CHARS && (cdc[i] & 1) != 0; i++)
+    ;
+  zero = i;
+  for (i = 0; i < N_CHARS && (cdc[i] & 1) != 1; i++)
+    ;
+  one = i;
+  /* Chance of the LSB being equal across 256 random values is 2^-255.  */
+  if ((cdc[zero] & 1) != 0 || (cdc[one] & 1) != 1)
+    error (EXIT_FAILURE, 0, _("low-entropy --random-source"));
+
+  uint64_t hash = 0; /* Zero is the target value */
+  idx_t const window = sizeof (hash) * CHAR_WIDTH;
+  char *t = xmalloc (window);
+  for (i = window - 1; i >= 0; i--)
+    {
+      unsigned char c = (hash & 1) ? one : zero;
+      t[i] = (char)c;
+      hash = (hash - cdc[c]) >> 1;
+    }
+  if (TEST)
+    assure ((hash = 0, gear64 (&hash, (void *)t, window), hash == 0));
+  return t;
+}
+
+/* Split into pieces of approximately AVGSZ bytes, not larger than MAXSZ bytes,
+   using content-defined chunking with rolling HASH running over sliding window
+   of WIN bytes.  Use buffer BUF of IOSZ bytes for I/O.  The buffer has WINDOW
+   bytes prepended and a trailer of WINDOW bytes for GearHash terminator.  */
+static void
+bytes_cdc_split (const enum Cdc_type hash, intmax_t const avgsz,
+                 intmax_t const maxsz, idx_t const window, char *const buf,
+                 idx_t const iosz)
+{
+  _Static_assert (UINT64_MAX <= UINTMAX_MAX);
+  affirm (1 <= window && window <= avgsz && avgsz < maxsz && window <= iosz);
+  /* The adjustment matters when AVGSZ is close to WINDOW as the code does not
+     run Bernoulli trials against the hashes of the first WINDOW-1 bytes.  */
+  intmax_t const avgadj = avgsz - (window - 1);
+  /* Following idea of Daniel Lemire's paper "Fast Random Integer Generation
+     in an Interval" paper we approximate chunk size with Bernoulli probability
+     of 1/AVGSZ moved from floating point [0.0…1.0) domain to u32|u64.  */
+  uint32_t const le32 = cdc_is32 (hash) ? UINT32_MAX / avgadj : 0;
+  uint64_t const le64 = cdc_is64 (hash) ? UINT64_MAX / avgadj : 0;
+  void const *const ple = cdc_is32 (hash)   ? &le32
+                          : cdc_is64 (hash) ? &le64
+                                            : (affirm (false), NULL);
+  /* BUZHash and GearHash handle initial state differently.  GearHash
+     completely forgets initial state as the WINDOW bytes pass by.  BUZHash
+     continues to rotate bits of the initial state forever as that state is
+     never "shifted out".  Initial BUZHash value should be symmetric against
+     all barrel shifts: it should be either 0 or ~0, BUZHash becomes dependent
+     on offset of the WINDOW modulo UINT_WIDTH otherwise.  */
+  uint32_t hash32 = 0;
+  uint64_t hash64 = 0;
+  void *const phash = cdc_is32 (hash)   ? &hash32
+                      : cdc_is64 (hash) ? &hash64
+                                        : (affirm (false), NULL);
+  /* We can't rely on terminator to be intact in the trailer area right after
+     I/O buffer as short read() might return just a few bytes less than IOSZ.
+     That's why possible positions of multi-byte terminators might overlap
+     during different loop iterations and may overwrite each other.  Also,
+     reading file from disk usually returns full IOSZ buffer, but pipe and TCP
+     socket behave differently.  E.g. pipe has 64 KiB capacity limit on Linux
+     and socket buffer is scaled dynamically.  */
+  ssize_t terminator_at = 0;
+  char const *const terminator
+      = (hash == cdc_gear32)   ? gear32_terminator_alloc ()
+        : (hash == cdc_gear64) ? gear64_terminator_alloc ()
+                               : NULL;
+  cdchash_fn hashcall = (hash == cdc_buz32)    ? buz32
+                        : (hash == cdc_buz64)  ? buz64
+                        : (hash == cdc_gear32) ? gear32
+                        : (hash == cdc_gear64) ? gear64
+                                               : (affirm (false), NULL);
+  cdcfind_fn findcall = (hash == cdc_buz32)    ? buz32_find
+                        : (hash == cdc_buz64)  ? buz64_find
+                        : (hash == cdc_gear32) ? gear32_rawfind
+                        : (hash == cdc_gear64) ? gear64_rawfind
+                                               : (affirm (false), NULL);
+
+  bool new_file_flag = true;
+  bool filter_ok = true;
+  intmax_t write_at_most = maxsz;
+  idx_t to_gulp = window;
+  ssize_t n_read;
+
+  while ((n_read = read (STDIN_FILENO, buf, iosz)) > 0)
+    {
+      char *const eob = buf + n_read;
+      if (n_read != terminator_at && terminator)
+        {
+          memcpy (eob, terminator, window);
+          terminator_at = n_read;
+        }
+
+      /* So, we have some buffer with at least WINDOW bytes of data prepended
+         to it.  We need to find few points in the buffer.  The points, where:
+         1) HASH of initial WINDOW is ready, all bytes are gulped,
+         2) MAXSZ is reached, 3) HASH <= LE, 4) end of buffer resides.  */
+      for (char const *start = buf; start != eob;)
+        {
+          typedef unsigned char const cuchar_t;
+          idx_t const startsz = eob - start;
+          char const *const max_end
+              = write_at_most <= startsz ? start + write_at_most : NULL;
+          char const *hash_end = NULL;
+          char const *unread = start;
+
+          if (to_gulp)
+            {
+              idx_t const gulpable = MIN (startsz, to_gulp);
+              hashcall (phash, (cuchar_t*)start, gulpable);
+              to_gulp -= gulpable;
+              unread += gulpable;
+              if (!to_gulp && (hash32 <= le32 && hash64 <= le64))
+                hash_end = unread;
+            }
+
+          if (!hash_end && unread != eob)
+            {
+              char const *const le_at = (char const *)findcall (
+                  phash, ple, (cuchar_t *)unread, (cuchar_t *)eob, window);
+              if (le_at < eob)
+                unread = hash_end = (le_at + 1);
+              else
+                unread = eob;
+            }
+
+          if (TEST && hash_end)
+            {
+              void const *const last = hash_end - window;
+              uint64_t h64 = 0;
+              uint32_t h32 = 0;
+              if (phash == &hash64)
+                assure ((hashcall (&h64, last, window), h64 == hash64));
+              else if (phash == &hash32)
+                assure ((hashcall (&h32, last, window), h32 == hash32));
+              else
+                affirm (false);
+            }
+
+          char const *const wrend = (hash_end && max_end)
+                                        ? MIN (hash_end, max_end)
+                                    : hash_end ? hash_end
+                                    : max_end  ? max_end
+                                               : eob;
+          ssize_t const to_write = wrend - start;
+          if (filter_ok || new_file_flag)
+            filter_ok = cwrite (new_file_flag, start, to_write);
+          start = wrend;
+          new_file_flag = (hash_end || max_end);
+          if (new_file_flag)
+            {
+              write_at_most = maxsz;
+              hash32 = 0;
+              hash64 = 0;
+              to_gulp = window;
+            }
+          else
+            {
+              write_at_most -= to_write;
+            }
+        }
+      /* BUZHash depends on this window to shift old values out, GearHash
+         needs it to feed last WINDOW bytes on overrun in *_rawfind versions
+         combined with either short read or cut point close to the beginning
+         of the buffer.  Short read might also lead to overlap happening when
+         N_READ is less than WINDOW.  */
+      memmove (buf - window, eob - window, window);
+    }
+  if (n_read < 0)
+    error (EXIT_FAILURE, errno, "%s", quotef (infile));
+}
+
 /* Split into pieces of exactly N_LINES lines.
    Use buffer BUF, whose size is BUFSIZE.  */
 
@@ -1354,19 +1703,31 @@ no_filters:
     }								\
   while (0)
 
+static int
+einval_ok (int e) { return e == EINVAL ? 0 : e; }
+
 /* Report a string-to-integer conversion failure MSGID with ARG.  */
 
 static _Noreturn void
 strtoint_die (char const *msgid, char const *arg)
 {
-  error (EXIT_FAILURE, errno == EINVAL ? 0 : errno, "%s: %s",
+  error (EXIT_FAILURE, einval_ok (errno), "%s: %s",
          gettext (msgid), quote (arg));
 }
 
+static _Noreturn void
+strtoint_die2 (char const *msgid, char const *arg, char const *end)
+{
+  error (EXIT_FAILURE, einval_ok (errno), "%s: %s",
+         gettext (msgid), quote_mem (arg, end - arg));
+}
+
 /* Use OVERFLOW_OK when it is OK to ignore LONGINT_OVERFLOW errors, since the
    extreme value will do the right thing anyway on any practical platform.  */
 #define OVERFLOW_OK LONGINT_OVERFLOW
 
+static char const byte_multipliers[] = "bEGKkMmPQRTYZ0";
+
 /* Parse ARG for number of bytes or lines.  The number can be followed
    by MULTIPLIERS, and the resulting value must be positive.
    If the number cannot be parsed, diagnose with MSG.
@@ -1394,24 +1755,231 @@ parse_chunk (intmax_t *k_units, intmax_t *n_units, char const *arg)
       *n_units = parse_n_units (argend + 1, "",
                                 N_("invalid number of chunks"));
       if (! (0 < *k_units && *k_units <= *n_units))
-        error (EXIT_FAILURE, 0, "%s: %s", _("invalid chunk number"),
-               quote_mem (arg, argend - arg));
+        strtoint_die2 ( N_("invalid chunk number"), arg, argend);
     }
   else if (! (e <= OVERFLOW_OK && 0 < *n_units))
     strtoint_die (N_("invalid number of chunks"), arg);
 }
 
+/* Parse HASH[WINDOW]/AVG/MAX syntax of content-defined chunking SIZE.  */
+
+static enum Cdc_type
+parse_cdc (intmax_t *window, intmax_t *avgsz, intmax_t *maxsz, char const *arg)
+{
+  enum Cdc_type hash = cdc_undef;
+  for (int i = 0; i < ARRAY_SIZE (cdc_names); ++i)
+    if (STRPREFIX (arg, cdc_names[i]))
+      {
+        arg += strlen (cdc_names[i]);
+        hash = (enum Cdc_type)i;
+        break;
+      }
+  if (hash == cdc_undef)
+    strtoint_die (N_("unknown rolling hash"), arg);
+
+  if (*arg == '[')
+    {
+      char *next = NULL;
+      arg++; /* skip '[' */
+      strtol_error e = xstrtoimax (arg, &next, 10, window, byte_multipliers);
+      /* Window below hash width makes bad PRF out of BUZHash for sure.  Longer
+         window does not guarantee good PRF though.  It's possible to implement
+         GearHash over shortened window, but it makes terminator calculation
+         trickier and overall utility of reduced-window GearHash is unclear. */
+      if (e == LONGINT_INVALID_SUFFIX_CHAR && STRNCMP_LIT (next, "]/") == 0
+          && (hash != cdc_buz32 || *window >= 4)
+          && (hash != cdc_buz64 || *window >= 8)
+          && (hash != cdc_gear32 || *window == 32)
+          && (hash != cdc_gear64 || *window == 64))
+        arg = next + 1;
+      else
+        strtoint_die2 (N_ ("invalid rolling hash window"), arg, next);
+    }
+  else if (*arg == '/')
+    switch (hash)
+      {
+      case cdc_buz32:
+      case cdc_buz64:
+        *window = 4095; /* following BorgBackup default */
+        break;
+      case cdc_gear32:
+        *window = 32;
+        break;
+      case cdc_gear64:
+        *window = 64;
+        break;
+      case cdc_undef:
+        affirm (false);
+      }
+  else
+    error (EXIT_FAILURE, 0, _("can't parse %s"), quote (arg));
+
+  arg++; /* skip '/' */
+  char *next = NULL;
+  strtol_error e = xstrtoimax (arg, &next, 10, avgsz, byte_multipliers);
+  if (!(e == LONGINT_OK || (e == LONGINT_INVALID_SUFFIX_CHAR && *next == '/'))
+      || *avgsz < 1)
+    strtoint_die2 (N_ ("invalid average chunk size"), arg, next);
+
+  /* AVGSZ > WINDOW is not a hard requirement for rolling hash, but it's way
+     easier to reason about chunks having at least WINDOW bytes each.  */
+  if (*avgsz <= *window)
+    strtoint_die2 (N_ ("average chunk must be larger than window"), arg, next);
+
+  /* Let's set 40 MiB as the largest chunk size that is supported by decision
+     function over 32-bit hash value with 1% error tolerance.  The smallest
+     value to exceed 1% err is 43821726 bytes.
+
+     Other options are to make exception for power-of-two values or to compute
+     error margin for specific AVGSZ value.  However, several discontinuous
+     ranges of accepted values are kinda confusing from UX standpoint.
+
+     High power-of-two values like 2G or 4G bring another issue to the table.
+     It's not _proven_ that BUZHash can actually produce every possible N-bit
+     hash value for every possible WINDOW.  So it's not proven that 1 or 0 will
+     ever be emitted as a hash value.  It's trivially false for small windows:
+     e.g. 3-byte window has no way to produce more than 2^24 hashes.  */
+  intmax_t const forty_mib = INTMAX_C (41943040);
+  if (cdc_is32 (hash) && *avgsz > forty_mib)
+    strtoint_die2 (N_ ("average chunk that large needs 64-bit hash"), arg,
+                   next);
+
+  /* There is no explicit "signaling" value to skip MAXSZ code altogether.
+     First, 2^63 is large enough.  Second, CDC is probabilistic anyway :-P  */
+  _Static_assert (INT64_MAX <= INTMAX_MAX);
+  *maxsz = INTMAX_MAX;
+  if (*next == '/'
+      && (xstrtoimax (next + 1, NULL, 10, maxsz, byte_multipliers)
+              != LONGINT_OK
+          || *maxsz <= *avgsz))
+    strtoint_die (N_ ("invalid maximum chunk size"), next + 1);
+
+  return hash;
+}
+
+static void
+cdc_table_init (enum Cdc_type hash, char const *random_source, idx_t window)
+{
+  /* Alignment is not vital for CDC lookup tables, but it saves one cache-line
+     and it might save us from confusing fall from the D-cache cliff.  */
+  idx_t const cacheline_size = getcachelinesize ();
+
+  /* The code does not support vectorised rolling hash _implementations_.
+     Naive vectorisation hits memory wall as each input byte is processed
+     through lookup table at least once.  Replacing lookup table with
+     pseudo-random function from u8 to u32|u64 is possible but it makes
+     its interface different as --random-source would use entropy differently.
+     So, potential SIMD implementation is effectively a _different_ rolling
+     hash function with different name.  And it still has to be quite fast
+     to beat SISD implementation running at 1.33 cpb :-)  */
+  if (!random_source && cdc_is64 (hash))
+    cdc_table = buz_seed;
+  else if (!random_source && cdc_is32 (hash))
+    {
+      uint32_t *t = xalignalloc (cacheline_size, N_CHARS * sizeof (*t));
+      for (idx_t i = 0; i < N_CHARS; i++)
+        t[i] = buz_seed[i];
+      cdc_table = t;
+    }
+  else if (random_source && cdc_isgear (hash))
+    {
+      /* random-source for GearHash is N_CHARS little-endian integers */
+      size_t const sizeof_hash
+          = cdc_is64 (hash) ? sizeof (uint64_t) : sizeof (uint32_t);
+      size_t const sizeof_table = N_CHARS * sizeof_hash;
+      void *t = xalignalloc (cacheline_size, sizeof_table);
+      FILE *fd = fopen (random_source, "rb");
+      if (!fd)
+        error (EXIT_FAILURE, errno, "%s", quotef (random_source));
+      size_t const n_read = fread (t, 1, sizeof_table, fd);
+      if (n_read != sizeof_table)
+        error (EXIT_FAILURE, 0, _("%s: got only %zu of %zu bytes"),
+               quotef (random_source), n_read, sizeof_table);
+      fclose (fd);
+      if (hash == cdc_gear64)
+        for (uint64_t *p = t, *const end = p + N_CHARS; p != end; p++)
+          *p = le64toh (*p);
+      else
+        for (uint32_t *p = t, *const end = p + N_CHARS; p != end; p++)
+          *p = le32toh (*p);
+      cdc_table = t;
+    }
+  else if (random_source && cdc_isbuz (hash))
+    {
+      /* random-source for BUZHash is more complex, make-buz-table.c describes
+         the reasons.  GearHash random-source reader is constant-time! It's
+         also 10 times faster, but that's 0.1M CPU cycles vs. 1M.  Extra 0.4ms
+         to init is irrelevant: it's BUZHash runtime over ~400 KiB of data.  */
+      size_t const sizeof_hash
+          = cdc_is64 (hash) ? sizeof (uint64_t) : sizeof (uint32_t);
+      size_t const hash_width = sizeof_hash * UCHAR_WIDTH;
+      /* randperm_bound() is not really a strict bound, it's just a hint.
+         Infinite stream of 0xFF bytes makes the sampling RNG loop!  */
+      size_t const seed_size
+          = randperm_bound (N_CHARS / 2, N_CHARS) * hash_width;
+      struct randint_source *r = randint_all_new (random_source, seed_size);
+      if (!r)
+        error (EXIT_FAILURE, errno, "%s", quotef (random_source));
+      size_t const sizeof_table = sizeof_hash * N_CHARS;
+      void *const t = xalignalloc (cacheline_size, sizeof_table);
+      memset (t, 0, sizeof_table);
+      for (unsigned bit = 0; bit < hash_width; bit++)
+        {
+          uint32_t *const t32 = t;
+          uint64_t *const t64 = t;
+          size_t *const perm = randperm_new (r, N_CHARS / 2, N_CHARS);
+          if (hash == cdc_buz64)
+            for (unsigned c = 0; c < N_CHARS / 2; c++)
+              t64[perm[c]] |= UINT64_C (1) << bit;
+          else
+            for (unsigned c = 0; c < N_CHARS / 2; c++)
+              t32[perm[c]] |= UINT32_C (1) << bit;
+          free (perm);
+        }
+      if (randint_all_free (r))
+        error (EXIT_FAILURE, errno, "%s", quotef (random_source));
+      cdc_table = t;
+    }
+  else
+    affirm (false);
+
+  affirm (window >= 1);
+  if ((hash == cdc_buz32 && window % 32 == 0)
+      || (hash == cdc_buz64 && window % 64 == 0))
+    unbuz_table = cdc_table;
+  else if (hash == cdc_buz32)
+    {
+      uint32_t const *const buz = cdc_table;
+      uint32_t *t = xalignalloc (cacheline_size, N_CHARS * sizeof *t);
+      for (int i = 0; i < N_CHARS; i++)
+        t[i] = rotl32 (buz[i], window % 32);
+      unbuz_table = t;
+    }
+  else if (hash == cdc_buz64)
+    {
+      uint64_t const *const buz = cdc_table;
+      uint64_t *t = xalignalloc (cacheline_size, N_CHARS * sizeof *t);
+      for (int i = 0; i < N_CHARS; i++)
+        t[i] = rotl64 (buz[i], window % 64);
+      unbuz_table = t;
+    }
+  else
+    affirm ((hash == cdc_gear32 && window == 32)
+            || (hash == cdc_gear64 && window == 64));
+}
 
 int
 main (int argc, char **argv)
 {
   enum Split_type split_type = type_undef;
+  enum Cdc_type cdc_type = cdc_undef;
   idx_t in_blk_size = 0;	/* optimal block size of input file device */
   idx_t page_size = getpagesize ();
   intmax_t k_units = 0;
+  intmax_t w_units = 0;
   intmax_t n_units = 0;
+  char const *random_source = NULL;
 
-  static char const multipliers[] = "bEGKkMmPQRTYZ0";
   int c;
   int digits_optind = 0;
   off_t file_size = OFF_T_MAX;
@@ -1461,9 +2029,20 @@ main (int argc, char **argv)
         case 'b':
           if (split_type != type_undef)
             FAIL_ONLY_ONE_WAY ();
-          split_type = type_bytes;
-          n_units = parse_n_units (optarg, multipliers,
-                                   N_("invalid number of bytes"));
+          /* skip any whitespace */
+          while (isspace (to_uchar (*optarg)))
+            optarg++;
+          if (isdigit (*optarg))
+            {
+              split_type = type_bytes;
+              n_units = parse_n_units (optarg, byte_multipliers,
+                                       N_("invalid number of bytes"));
+            }
+          else
+            {
+              split_type = type_bytes_cdc;
+              cdc_type = parse_cdc (&w_units, &n_units, &k_units, optarg);
+            }
           break;
 
         case 'l':
@@ -1477,7 +2056,7 @@ main (int argc, char **argv)
           if (split_type != type_undef)
             FAIL_ONLY_ONE_WAY ();
           split_type = type_byteslines;
-          n_units = parse_n_units (optarg, multipliers,
+          n_units = parse_n_units (optarg, byte_multipliers,
                                    N_("invalid number of lines"));
           break;
 
@@ -1596,12 +2175,18 @@ main (int argc, char **argv)
           filter_command = optarg;
           break;
 
+        case RANDOM_SOURCE_OPTION:
+          if (random_source && !streq (random_source, optarg))
+            error (EXIT_FAILURE, 0, _("multiple random sources specified"));
+          random_source = optarg;
+          break;
+
         case IO_BLKSIZE_OPTION:
-          in_blk_size = xnumtoumax (optarg, 10, 1,
-                                    MIN (SYS_BUFSIZE_MAX,
-                                         MIN (IDX_MAX, SIZE_MAX) - 1),
-                                    multipliers, _("invalid IO block size"),
-                                    0, XTOINT_MIN_RANGE);
+          in_blk_size
+              = xnumtoumax (optarg, 10, 1,
+                            MIN (SYS_BUFSIZE_MAX, MIN (IDX_MAX, SIZE_MAX) - 1),
+                            byte_multipliers, _ ("invalid IO block size"), 0,
+                            XTOINT_MIN_RANGE);
           break;
 
         case VERBOSE_OPTION:
@@ -1617,7 +2202,7 @@ main (int argc, char **argv)
         }
     }
 
-  if (k_units != 0 && filter_command)
+  if (split_type != type_bytes_cdc && k_units != 0 && filter_command)
     {
       error (0, 0, _("--filter does not process a chunk extracted to "
                      "standard output"));
@@ -1688,10 +2273,40 @@ main (int argc, char **argv)
       if (SYS_BUFSIZE_MAX < in_blk_size)
         in_blk_size = SYS_BUFSIZE_MAX;
     }
+  int const buz_window_max = MIN (in_blk_size, IO_BUFSIZE);
+  if (split_type == type_bytes_cdc && cdc_isbuz (cdc_type)
+      && buz_window_max < w_units)
+    error (
+        EXIT_FAILURE, 0,
+        _ ("%" PRIdMAX " exceeds the largest supported BUZHash window (%d)"),
+        w_units, buz_window_max);
+
+  /* The I/O buffer is IN_BLK_SIZE bytes and is aligned to the PAGE_SIZE.
+     lines_split() uses one more byte so avoid boundary checks with rawmemchr.
+     The same idea needs W_UNITS bytes to terminate GearHash computation.  */
+  idx_t buf_size = in_blk_size;
+  if (split_type == type_digits || split_type == type_lines)
+    buf_size += 1;
+  else if (split_type == type_bytes_cdc && cdc_isgear (cdc_type))
+    buf_size += w_units;
+
+  /* BUZHash needs prepended WINDOW for computation, GearHash needs it
+     for backtracking on short read.  */
+  idx_t prepend = 0;
+  if (split_type == type_bytes_cdc)
+    if (ckd_add (&prepend, 1, (w_units - 1) | (page_size - 1)))
+      xalloc_die ();
+  if (ckd_add (&buf_size, buf_size, prepend))
+    xalloc_die ();
+
+  char *buf = xalignalloc (page_size, buf_size);
+  /* memset() is here to suppress warning about reading uninitialized memory
+     with memmove() in case of short read.  The uninitialized value is not used
+     in computation as it's still hash gulping stage of bytes_cdc_split.  */
+  memset (buf, 0, prepend);
+  buf += prepend;
 
-  char *buf = xalignalloc (page_size, in_blk_size + 1);
   ssize_t initial_read = -1;
-
   if (split_type == type_chunk_bytes || split_type == type_chunk_lines)
     {
       file_size = input_file_size (STDIN_FILENO, &in_stat_buf,
@@ -1701,6 +2316,8 @@ main (int argc, char **argv)
                quotef (infile));
       initial_read = MIN (file_size, in_blk_size);
     }
+  else if (split_type == type_bytes_cdc)
+    cdc_table_init (cdc_type, random_source, w_units);
 
   /* When filtering, closure of one pipe must not terminate the process,
      as there may still be other streams expecting input from us.  */
@@ -1718,6 +2335,10 @@ main (int argc, char **argv)
       bytes_split (n_units, 0, buf, in_blk_size, -1, 0);
       break;
 
+    case type_bytes_cdc:
+      bytes_cdc_split (cdc_type, n_units, k_units, w_units, buf, in_blk_size);
+      break;
+
     case type_byteslines:
       line_bytes_split (n_units, buf, in_blk_size);
       break;
diff --git src/split_cdc.c src/split_cdc.c
new file mode 100644
index 000000000..96424387b
--- /dev/null
+++ src/split_cdc.c
@@ -0,0 +1,150 @@
+/* Hot lookup functions for CDC in split(1).
+
+   Contributed to the GNU project by Leonid Evdokimov.
+
+   Copyright (C) 2026 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+#include <config.h>
+#include "split_cdc.h"
+#include "assure.h"
+
+/* GCC default align-loops is 8, but 16 and 32 were producing interesting
+   results during benchmarking of various version of this code.  However,
+   the current code doesn't seem to get clear benefits from loop alignment.
+
+   Comparing performance of --bytes gear64/1M, buz64/1M and 1M on i7-6600U
+   suggests that GearHash runs at 1.33 cpb and BUZHash runs at 2.75 cpb.
+
+   It is tempting to remove bounds checks from buz64_find() and buz32_find()
+   caching WINDOW bytes of the first match, but BUZHash is already heavy
+   on well-cached memory reads, so buz64_rawfind() relying on terminator and
+   re-computation performs 16% worse than buz64_find() relying on bounds check:
+
+   Intel Core i7-6600U: gets -25% instructions and -46% branches,
+   but +16% cycles and +16% cycle_activity.cycles_mem_any.
+
+   That's why BUZHash works like memchr and GearHash like rawmemchr.  */
+
+extern unsigned char const *
+buz32_find (void *phash_, void const *ple_, unsigned char const *p,
+            unsigned char const *const end, idx_t window)
+{
+  uint32_t *const phash = phash_;
+  uint32_t const *const ple = ple_;
+  uint32_t const *const buz = cdc_table;
+  uint32_t const *const unbuz = unbuz_table;
+  assume (p < end);
+  uint32_t hash = *phash;
+  uint32_t const le = *ple;
+  for (; p != end; p++)
+    {
+      hash = unbuz[p[-window]] ^ rotl32 (hash, 1) ^ buz[*p];
+      if (hash <= le)
+        break;
+    }
+  *phash = hash;
+  return p;
+}
+
+extern unsigned char const *
+buz64_find (void *phash_, void const *ple_, unsigned char const *p,
+            unsigned char const *const end, idx_t window)
+{
+  uint64_t *const phash = phash_;
+  uint64_t const *const ple = ple_;
+  uint64_t const *const buz = cdc_table;
+  uint64_t const *const unbuz = unbuz_table;
+  assume (p < end);
+  uint64_t hash = *phash;
+  uint64_t const le = *ple;
+  for (; p != end; p++)
+    {
+      hash = unbuz[p[-window]] ^ rotl64 (hash, 1) ^ buz[*p];
+      if (hash <= le)
+        break;
+    }
+  *phash = hash;
+  return p;
+}
+
+extern unsigned char const *
+gear32_rawfind (void *phash_, void const *ple_, unsigned char const *p,
+                unsigned char const *const end, idx_t)
+{
+  uint32_t *const phash = phash_;
+  uint32_t const *const ple = ple_;
+  uint32_t const *const cdc = cdc_table;
+  assume (p < end);
+  uint32_t hash = *phash;
+  uint32_t const le = *ple;
+  for (;; p++)
+    {
+      hash = (hash << 1) + cdc[*p];
+      if (hash <= le)
+        break;
+    }
+  if (p < end)
+    {
+      *phash = hash;
+      return p;
+    }
+  else
+    {
+      idx_t const window = 32;
+      gear32 (phash, end - window, window);
+      return end;
+    }
+}
+
+/* It's trivial to compute reverse of GearHash for any hash value.
+   Let's drop one branch out of two: put the terminator value hashing to zero
+   at the end of the buffer, just like lines_split() does calling rawmemchr().
+
+   Performance gain over gear64_find() is low but noticeable on tested CPUs:
+
+   Intel Core i7-6600U: -22% instructions, -9% cycles.
+   Apple Icestorm-M1:                      -8% task-clock.
+   Apple Firestorm-M1:  -33% instructions, -1.6% cycles (±0.04%).
+
+   Performance gain for gear32_rawfind() is similar on these CPUs.  */
+extern unsigned char const *
+gear64_rawfind (void *phash_, void const *ple_, unsigned char const *p,
+                unsigned char const *const end, idx_t)
+{
+  uint64_t *const phash = phash_;
+  uint64_t const *const ple = ple_;
+  uint64_t const *const cdc = cdc_table;
+  assume (p < end);
+  uint64_t hash = *phash;
+  uint64_t const le = *ple;
+  for (;; p++)
+    {
+      hash = (hash << 1) + cdc[*p];
+      if (hash <= le)
+        break;
+    }
+  if (p < end)
+    {
+      *phash = hash;
+      return p;
+    }
+  else
+    {
+      idx_t const window = 64;
+      gear64 (phash, end - window, window);
+      return end;
+    }
+}
diff --git src/split_cdc.h src/split_cdc.h
new file mode 100644
index 000000000..c6250f97b
--- /dev/null
+++ src/split_cdc.h
@@ -0,0 +1,75 @@
+/* Header for hot lookup functions for CDC in split(1).
+
+   Contributed to the GNU project by Leonid Evdokimov.
+
+   Copyright (C) 2026 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <https://www.gnu.org/licenses/>.  */
+
+#ifndef UUID_49BA2172_7262_4B8D_939C_3701442E7FC2
+#define UUID_49BA2172_7262_4B8D_939C_3701442E7FC2
+
+#include <idx.h>
+#include <stdint.h>
+
+/* 64 is the cache-line size for x86-64, Apple M-series chips use 128 bytes.
+   128 works as a good default. It wastes 64 bytes in the worst-case.  */
+enum { CDC_TABLE_DEFAULT_ALIGNAS = 128 };
+
+/* HASH and LE values are passed via pointers due to u32|u64 difference.  */
+typedef void (*cdchash_fn) (void *phash, unsigned char const *p, idx_t count);
+typedef unsigned char const *(*cdcfind_fn) (void *phash, void const *ple,
+                                            unsigned char const *p,
+                                            unsigned char const *const end,
+                                            idx_t window);
+
+extern uint64_t const buz_seed[256];
+
+extern void const *cdc_table;
+
+extern void const *unbuz_table;
+
+void buz32 (void *phash, unsigned char const *p, idx_t count);
+void buz64 (void *phash, unsigned char const *p, idx_t count);
+void gear32 (void *phash, unsigned char const *p, idx_t count);
+void gear64 (void *phash, unsigned char const *p, idx_t count);
+
+unsigned char const *gear32_rawfind (void *phash, void const *ple,
+                                     unsigned char const *p,
+                                     unsigned char const *const end,
+                                     idx_t window);
+unsigned char const *gear64_rawfind (void *phash, void const *ple,
+                                     unsigned char const *p,
+                                     unsigned char const *const end,
+                                     idx_t window);
+unsigned char const *buz32_find (void *phash, void const *ple,
+                                 unsigned char const *p,
+                                 unsigned char const *const end, idx_t window);
+unsigned char const *buz64_find (void *phash, void const *ple,
+                                 unsigned char const *p,
+                                 unsigned char const *const end, idx_t window);
+
+static inline uint32_t
+rotl32 (uint32_t x, unsigned int n)
+{
+  return (x << n) | (x >> ((-n) % 32));
+}
+
+static inline uint64_t
+rotl64 (uint64_t x, unsigned int n)
+{
+  return (x << n) | (x >> ((-n) % 64));
+}
+
+#endif
diff --git tests/local.mk tests/local.mk
index b9f5b897a..38672c88f 100644
--- tests/local.mk
+++ tests/local.mk
@@ -446,6 +446,9 @@ all_tests =					\
   tests/split/suffix-length.sh			\
   tests/split/additional-suffix.sh		\
   tests/split/b-chunk.sh			\
+  tests/split/bytes-cdc.sh			\
+  tests/split/bytes-cdc-offbyone.sh		\
+  tests/split/random-source.sh			\
   tests/split/fail.sh				\
   tests/split/lines.sh				\
   tests/split/line-bytes.sh			\
diff --git tests/split/bytes-cdc-offbyone.sh tests/split/bytes-cdc-offbyone.sh
new file mode 100755
index 000000000..46d0dbc20
--- /dev/null
+++ tests/split/bytes-cdc-offbyone.sh
@@ -0,0 +1,73 @@
+#!/bin/sh
+# show that content-defined chunking works in 'split'.
+
+# Copyright (C) 2026 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/tests/init.sh"; path_prepend_ ./src
+print_ver_ split
+very_expensive_ # 2.5 minutes
+openssl version || skip_ 'openssl required'
+sp="$srcdir/tests/split"
+
+IO_BLKSIZE=$(( 256 * 1024 ))
+
+# Let's prepend few extra bytes to shift the 32-byte window through
+# the IO_BLKSIZE boundary slowly and check for possible off-by-one errors.
+exp="$sp/exp-intseed-gear32"
+off=1216420
+test $(head -n 1 "$exp") -eq "$off" || framework_failure_
+
+base=$(( $IO_BLKSIZE - ($off % $IO_BLKSIZE) - 35 ))
+for extra in $(seq $base $(( $base + 70 ))); do
+  # That's file and not pipe to ensure that full IO_BLKSIZE is utilized
+  { same_bytes_ $extra $seed_d && same_bytes_ 128M; } > input \
+    || framework_failure_
+
+  split ---io-blksize=$IO_BLKSIZE --bytes gear32/1M \
+    --filter 'wc --bytes' input > out || fail=1
+
+  out1=$(head -n 1 out)
+  tail -n +2 out > out2 \
+    && tail -n +2 "$exp" > exp2 \
+    || framework_failure_
+  test $out1 -eq $(( $off + $extra )) || fail=1
+  compare exp2 out2 || fail=1
+  rm -f input
+done
+
+# Do the same for BUZHash as it behaves a bit differently at the boundary:
+exp="$sp/exp-intseed-buz32-42"
+off=786013
+test $(head -n 1 "$exp") -eq "$off" || framework_failure_
+
+base=$(( $IO_BLKSIZE - ($off % $IO_BLKSIZE) - 45 ))
+for extra in $(seq $base $(( $base + 90 ))); do
+  { same_bytes_ $extra $seed_d && same_bytes_ 128M; } > input \
+    || framework_failure_
+
+  split ---io-blksize=$IO_BLKSIZE --bytes buz32[42]/1M \
+    --filter 'wc --bytes' input > out || fail=1
+
+  out1=$(head -n 1 out)
+  tail -n +2 out > out2 \
+    && tail -n +2 "$exp" > exp2 \
+    || framework_failure_
+  test $out1 -eq $(( $off + $extra )) || fail=1
+  compare exp2 out2 || fail=1
+  rm -f input
+done
+
+Exit $fail
diff --git tests/split/bytes-cdc.sh tests/split/bytes-cdc.sh
new file mode 100755
index 000000000..fc689a3eb
--- /dev/null
+++ tests/split/bytes-cdc.sh
@@ -0,0 +1,103 @@
+#!/bin/sh
+# show that content-defined chunking works in 'split'.
+
+# Copyright (C) 2026 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/tests/init.sh"; path_prepend_ ./src
+print_ver_ split
+openssl version || skip_ 'openssl required'
+sp="$srcdir/tests/split"
+
+# Ensure the same behavior across little-/big-endian and 64-/32-bit platforms
+# with canonical input.  First, using the PRNG'ed input with the built-in seed.
+# Second, with the same input and PRNG'ed seed as a --random-source.
+for seed in intseed extseed; do
+  if [ $seed = intseed ]; then
+    rs=
+  else
+    same_bytes_ 10K "$seed_c" > "$seed" || framework_failure_
+    rs="--random-source $seed"
+  fi
+
+  for fn in buz32 buz64 gear32 gear64; do
+    same_bytes_ 128M | split --bytes "$fn/1M" $rs \
+      --filter 'wc --bytes' > out || fail=1
+    compare "$sp/exp-${seed}-${fn}" out || fail=1
+  done
+  # Test non-default WINDOW and one that is a multiple of the register width.
+  for window in 42 512; do
+    for fn in buz32 buz64; do
+      same_bytes_ 128M | split --bytes "${fn}[${window}]/1M" $rs \
+        --filter 'wc --bytes' > out || fail=1
+      compare "$sp/exp-${seed}-${fn}-${window}" out || fail=1
+    done
+  done
+done
+rm -f out
+
+# Ensure that <(same_bytes_ 128M) is still the same after split.
+printf '3be8bea6b02ec4e6e85af9c3bfda278f480e49fd390b89d55668535ec4e53259  -' \
+  > 128M.b2sum
+same_bytes_ 128M | b2sum --check 128M.b2sum || framework_failure_
+for fn in buz32 buz64 gear32 gear64; do
+  same_bytes_ 128M | split --bytes "$fn/1M" - out.x || fail=1
+  cat out.x?? | b2sum --check 128M.b2sum || fail=1
+  rm -f out.x??
+done
+
+# Ensure that --filter failing with EPIPE works as expected...
+for fn in buz32 buz64 gear32 gear64; do
+  same_bytes_ 128M | split --bytes $fn/1M  - out.x
+
+  # both with buffer larger than Linux pipe capacity and with a small one.
+  head --silent --bytes 8 out.x?? | b2sum -l 256 > exp-8.b2sum
+  head --silent --bytes 72K out.x?? | b2sum -l 256 > exp-72K.b2sum
+
+  for bs in 72K 8; do
+    rm -f out.x??
+    same_bytes_ 128M | \
+      split --bytes $fn/1M --filter "head --bytes $bs"' > $FILE' - out.x
+    cat out.x?? | b2sum --check exp-$bs.b2sum || fail=1
+  done
+
+  # Double-check using number of bytes for a small 8-byte buffer.
+  # That's not true for a larger buffer as some chunks are smaller than 72K.
+  exp=$(wc -l < "$sp/exp-intseed-$fn")
+  test $(cat out.x?? | wc --bytes) -eq $(( exp * 8 )) || fail=1
+  rm -f out.x??
+done
+
+# Ensure that chunk max-size limit works.
+for fn in buz32 buz64 gear32 gear64; do
+  maxsz=2850325
+  same_bytes_ 128M | split --bytes "$fn/1M/$maxsz" \
+    --filter 'wc --bytes' > out || fail=1
+  # This AWK code doesn't work for arbitrary MAXSZ, but it works for test data.
+  awk -v M=$maxsz '
+    ($1 > M) {
+      for (i = 0; i < int($1 / M); i++)
+        print M;
+      print ($1 % M);
+    }
+    ($1 <= M)' <"$sp/exp-intseed-${fn}" >exp-$fn-EM
+  compare exp-${fn}-EM out || fail=1
+done
+
+# Max chunk size must be greater than the average chunk size.
+returns_ 1 split --bytes gear32/1M/1M   /dev/null || fail=1
+returns_ 1 split --bytes gear32/1M/512K /dev/null || fail=1
+
+Exit $fail
diff --git tests/split/exp-extseed-buz32 tests/split/exp-extseed-buz32
new file mode 100644
index 000000000..3baa3575b
--- /dev/null
+++ tests/split/exp-extseed-buz32
@@ -0,0 +1,112 @@
+1261438
+1613408
+1840980
+1055636
+210919
+261637
+484353
+76211
+2275378
+1022568
+769290
+879006
+208522
+1453117
+6846142
+2389038
+276323
+2744550
+5209545
+226421
+85231
+561929
+1527164
+98608
+133253
+446215
+4979880
+765788
+610225
+490840
+1693061
+730334
+982088
+741243
+14128
+27490
+81669
+5551681
+286471
+2535437
+389343
+2019731
+2586826
+2875499
+335209
+1184095
+287066
+1128159
+189694
+577642
+200700
+1597974
+892882
+1401259
+1001320
+74916
+61094
+2363081
+2005184
+145643
+776161
+78001
+80020
+705246
+451156
+554640
+106714
+2466967
+2965436
+1710490
+635953
+688524
+925620
+617055
+995477
+463072
+398772
+3340923
+969170
+564442
+232413
+474599
+2644580
+163295
+126331
+1445114
+654503
+312796
+4124388
+1220296
+1082982
+561827
+2587448
+1028169
+90695
+1913709
+456174
+2794953
+1554860
+733860
+679605
+694433
+1208822
+1321442
+2586290
+3229138
+856010
+309163
+91117
+602593
+1658663
+2519087
diff --git tests/split/exp-extseed-buz32-42 tests/split/exp-extseed-buz32-42
new file mode 100644
index 000000000..cc7346091
--- /dev/null
+++ tests/split/exp-extseed-buz32-42
@@ -0,0 +1,122 @@
+3067675
+1049700
+497700
+834400
+1345047
+373660
+3812603
+567182
+4210901
+615877
+1615753
+188525
+417961
+561649
+959798
+56989
+121949
+36068
+1544649
+69163
+58470
+333655
+1121254
+953549
+852668
+389015
+419596
+207977
+337436
+6909
+1113612
+718574
+170659
+525083
+2606309
+7976508
+75690
+1293
+3139393
+4337162
+1828218
+2618573
+2615762
+1114736
+2650731
+1666253
+31209
+434315
+2631333
+302883
+689451
+2058218
+219771
+320969
+2001518
+28281
+2108257
+710705
+498645
+1695353
+2890621
+1730297
+1093501
+4369785
+1187198
+153812
+441355
+427112
+558118
+453520
+901479
+2174965
+292321
+877761
+417585
+1046826
+23096
+1983481
+373476
+1829099
+1393415
+2017429
+879126
+855129
+387688
+97158
+1405790
+156206
+504564
+942551
+609696
+306100
+476493
+197918
+281934
+2259622
+2017125
+1491135
+1521299
+280512
+694978
+2010645
+515296
+1058818
+529541
+534338
+539237
+594327
+1164475
+1942629
+3182933
+163926
+365017
+927792
+1187478
+261545
+412725
+51332
+181742
+2479405
+383843
+212166
diff --git tests/split/exp-extseed-buz32-512 tests/split/exp-extseed-buz32-512
new file mode 100644
index 000000000..b5bea9e23
--- /dev/null
+++ tests/split/exp-extseed-buz32-512
@@ -0,0 +1,127 @@
+2326491
+1392837
+2782183
+2839123
+493728
+166627
+34612
+1198296
+811804
+155195
+446605
+164776
+62444
+595795
+51412
+40514
+181572
+1197540
+16469
+1572868
+1331775
+2550289
+265640
+356447
+464319
+352703
+1822336
+1194338
+485000
+1732981
+1421700
+927016
+391995
+142683
+1159532
+420721
+62461
+900867
+561320
+1502311
+1790656
+1603067
+1726728
+547269
+122734
+1892715
+831238
+818023
+275343
+226134
+33695
+264899
+1285626
+2553194
+1193942
+816743
+380293
+460999
+2165674
+4241992
+1894678
+187925
+719645
+1415794
+2121976
+71427
+99962
+1681047
+1544096
+5207854
+490362
+2805130
+141595
+324765
+1973694
+304242
+333427
+272777
+3341175
+1197955
+1995719
+1976838
+48427
+490241
+116714
+767818
+1647768
+1033725
+3059334
+86370
+39889
+3340386
+2236159
+972749
+1246871
+2343841
+249684
+1233630
+142049
+593462
+3482110
+656847
+592899
+600276
+190424
+1775099
+440486
+775115
+76024
+714017
+225449
+127045
+213276
+1019804
+1764251
+24497
+123036
+461473
+872860
+547840
+28766
+4828970
+2463368
+1680264
+743719
+495958
+1734366
diff --git tests/split/exp-extseed-buz64 tests/split/exp-extseed-buz64
new file mode 100644
index 000000000..2fbc60f16
--- /dev/null
+++ tests/split/exp-extseed-buz64
@@ -0,0 +1,122 @@
+714188
+420144
+455615
+292586
+2166914
+15044
+81887
+685326
+89508
+374098
+1080966
+234417
+369074
+899991
+1389612
+843479
+4371885
+552436
+623634
+459972
+3741631
+8295536
+334020
+225191
+453678
+1270234
+94242
+1314669
+677420
+578854
+98607
+3214288
+2161151
+1216503
+1229251
+986416
+654813
+529503
+3193252
+1087139
+719227
+1666091
+621348
+1103743
+2410361
+155865
+15550
+199491
+1491828
+720568
+62879
+575690
+245119
+81882
+2356484
+3823827
+550251
+3403379
+3542019
+836407
+1052041
+209517
+54433
+2747809
+1111904
+8390188
+397968
+741423
+1128524
+97108
+2907415
+546821
+886890
+1718094
+894855
+348223
+231130
+390141
+1745297
+44710
+623433
+598615
+730673
+380717
+863523
+1258176
+812728
+716975
+926095
+2089604
+238552
+1055561
+1062009
+1056102
+983309
+322367
+1970601
+1306990
+1657115
+877432
+908189
+127589
+936077
+504517
+236642
+470415
+221948
+735426
+305461
+62351
+82689
+343111
+1033230
+1895629
+461173
+606656
+83606
+3783435
+743748
+335619
+1703469
+1402497
diff --git tests/split/exp-extseed-buz64-42 tests/split/exp-extseed-buz64-42
new file mode 100644
index 000000000..7f05caa33
--- /dev/null
+++ tests/split/exp-extseed-buz64-42
@@ -0,0 +1,139 @@
+2143865
+30055
+23107
+251137
+68283
+385402
+505244
+1030046
+2082283
+673007
+482043
+1548176
+510209
+170630
+299772
+4260983
+1220047
+334536
+412304
+1108516
+1873831
+742585
+2018886
+818464
+772588
+1669555
+486421
+1156070
+3973664
+225558
+1617048
+211530
+63550
+161266
+444358
+382042
+981010
+1725541
+421565
+444731
+2099431
+1230633
+4227
+96658
+148435
+602854
+810863
+343805
+1349979
+122152
+1232192
+527354
+419761
+528877
+33053
+2905
+28291
+2749103
+274895
+1122055
+692590
+3376548
+101786
+855385
+120251
+510054
+1031099
+560485
+494723
+1606567
+131557
+519824
+2145574
+7514
+387330
+1943200
+2858587
+89156
+136412
+1030293
+630633
+1222687
+705650
+135647
+677090
+1448923
+2115091
+323629
+358757
+51530
+1766140
+1662001
+456441
+948949
+963008
+1187537
+6507716
+708935
+331339
+1604072
+973508
+53367
+3304723
+3429482
+1318946
+614059
+815283
+673978
+132062
+517423
+5695
+453140
+914496
+2970427
+2451695
+2531183
+538317
+706404
+2094960
+524849
+239136
+660195
+93345
+1653607
+128383
+116002
+507440
+1978391
+974895
+931313
+91318
+466300
+21239
+4436478
+516633
+211925
+24347
+98216
+2182427
diff --git tests/split/exp-extseed-buz64-512 tests/split/exp-extseed-buz64-512
new file mode 100644
index 000000000..398b11c37
--- /dev/null
+++ tests/split/exp-extseed-buz64-512
@@ -0,0 +1,131 @@
+349208
+106431
+127323
+36202
+228238
+510751
+704440
+210495
+1550466
+2860869
+1086173
+2501806
+592013
+1944124
+784218
+89550
+537700
+94412
+3389667
+989462
+255346
+1710711
+1029826
+1468757
+1180397
+902052
+1845487
+2003530
+419457
+561474
+135770
+968855
+1087269
+194234
+1570832
+747838
+112609
+1522675
+22153
+1089631
+2694
+2087358
+900935
+794944
+1134301
+4218131
+1532239
+655240
+1135936
+576669
+269642
+736598
+2530316
+241919
+852233
+1890424
+1355076
+150711
+1444104
+1487979
+110637
+1143693
+611423
+124664
+179179
+391579
+1382176
+1521025
+1310529
+124705
+1156767
+1874670
+952172
+1968414
+146973
+885773
+414384
+40499
+499524
+119577
+538912
+239784
+580998
+4491196
+245782
+1614860
+1888690
+1211912
+503481
+126459
+638869
+58619
+1390092
+1080850
+614041
+698078
+233458
+91884
+3005565
+1237575
+379148
+3413717
+2004177
+519371
+243788
+4132305
+253811
+1073266
+1358858
+219584
+337019
+508666
+63730
+10141
+194406
+897536
+522380
+1295806
+1191709
+985145
+2864762
+21102
+243795
+2090091
+562661
+2733948
+1325702
+824495
+1428226
+4727504
+821511
diff --git tests/split/exp-extseed-gear32 tests/split/exp-extseed-gear32
new file mode 100644
index 000000000..877ff09da
--- /dev/null
+++ tests/split/exp-extseed-gear32
@@ -0,0 +1,129 @@
+458704
+1233727
+1951930
+1050043
+1287616
+41083
+167500
+1651359
+701576
+35109
+122755
+163879
+1324063
+95412
+825114
+159751
+334316
+707576
+567946
+802452
+1720440
+240853
+1572732
+629661
+1397905
+731040
+670517
+2828856
+186434
+844385
+140894
+604879
+365517
+253974
+330572
+2360192
+31283
+1270216
+105436
+1570952
+576709
+200015
+475844
+741986
+1248860
+636660
+75022
+97497
+2345404
+1361694
+2401324
+420157
+996165
+424913
+2042468
+214309
+1261794
+1186817
+126465
+871509
+2675774
+1795639
+564320
+905802
+1150401
+1070664
+1142283
+608117
+1950938
+2589687
+122618
+2357367
+100326
+869528
+1574363
+337133
+244553
+962507
+2140611
+2050992
+1162993
+2425709
+1141189
+1001050
+4049285
+343107
+2135800
+13601
+992664
+800731
+1159118
+1253739
+884939
+282603
+1258722
+2165328
+2456907
+119059
+3069428
+485583
+521662
+987513
+552949
+1371989
+986358
+716560
+2183313
+1160920
+899688
+1883236
+1375197
+311496
+2189572
+1191274
+185401
+2804995
+1258127
+481647
+903896
+888382
+361986
+45562
+1035576
+1876766
+222527
+2035028
+968167
+142317
+3016185
diff --git tests/split/exp-extseed-gear64 tests/split/exp-extseed-gear64
new file mode 100644
index 000000000..f5c8517ab
--- /dev/null
+++ tests/split/exp-extseed-gear64
@@ -0,0 +1,138 @@
+2313060
+1099982
+1510663
+306331
+927071
+747377
+1026265
+3592764
+382648
+2048152
+716035
+1189115
+1278277
+951374
+1125995
+929166
+558763
+402371
+3745
+275154
+305652
+1078085
+1760570
+63387
+352296
+211640
+4078628
+1219944
+1651050
+75373
+962222
+367285
+1024037
+1825233
+95439
+1037792
+273660
+13408
+163464
+455421
+156097
+534704
+323627
+2786179
+3868728
+199165
+2879685
+1222473
+1942194
+144530
+1086959
+340918
+862676
+471636
+981690
+3007324
+254908
+664082
+937607
+448435
+658771
+152144
+2299571
+469452
+396059
+477919
+35892
+723818
+850436
+125732
+283570
+35137
+2773224
+1521298
+254429
+2347375
+618483
+317153
+1330567
+894199
+667682
+1365911
+2906867
+8316
+3179190
+1453935
+2225432
+950093
+231109
+413390
+231443
+929667
+90169
+198546
+649806
+977085
+394904
+1289552
+34723
+24962
+1247519
+1116428
+207221
+529612
+1251170
+428308
+907330
+455746
+1849738
+505481
+1033801
+685086
+2850384
+1038170
+883961
+991371
+361781
+429521
+956456
+78077
+1053089
+304397
+20244
+29569
+161603
+2573763
+2030852
+815720
+760113
+1452018
+575171
+2523875
+1140755
+3926774
+1289223
+982012
+283507
+187360
diff --git tests/split/exp-intseed-buz32 tests/split/exp-intseed-buz32
new file mode 100644
index 000000000..d32f70728
--- /dev/null
+++ tests/split/exp-intseed-buz32
@@ -0,0 +1,123 @@
+332216
+251470
+3095088
+1725672
+68521
+2573232
+3252818
+1664001
+2034146
+925543
+330182
+877303
+48678
+2130078
+1529923
+1086291
+1263597
+1138420
+1535349
+745669
+831830
+1305874
+326940
+1369547
+505254
+217831
+660676
+666341
+969528
+1558201
+1160661
+23558
+591890
+2790925
+954897
+329926
+2364220
+2179763
+1030458
+322009
+555304
+1255713
+442997
+514319
+155024
+5603269
+520482
+218500
+696890
+1122708
+72307
+77569
+1501455
+31754
+851158
+889157
+1217174
+693235
+381482
+492139
+966710
+623754
+1734652
+2248464
+1147844
+624051
+735115
+38654
+470661
+435436
+2876378
+2586628
+412228
+474844
+908122
+763665
+157081
+844661
+558726
+563589
+2150023
+2151393
+31619
+77069
+3105759
+3094512
+142469
+623804
+945624
+1146178
+1551649
+94253
+76500
+1836396
+990366
+878910
+831478
+207849
+87752
+2844187
+945619
+2798903
+617705
+1938590
+1512549
+121137
+377177
+1610928
+2439078
+381046
+1662248
+1327011
+696555
+489074
+1332991
+402766
+458485
+302769
+89729
+4758416
+1450403
+1776206
+628128
diff --git tests/split/exp-intseed-buz32-42 tests/split/exp-intseed-buz32-42
new file mode 100644
index 000000000..6d1bea497
--- /dev/null
+++ tests/split/exp-intseed-buz32-42
@@ -0,0 +1,132 @@
+786013
+7725759
+849906
+1071767
+1305252
+126238
+33984
+1750354
+2749944
+77576
+394105
+2342621
+237467
+1937138
+1839853
+1971452
+912780
+1205163
+80462
+270020
+240460
+787713
+373366
+2090746
+745931
+2727757
+1828087
+707677
+677085
+513739
+221246
+316214
+159613
+1865645
+3956
+1757732
+62662
+813497
+521433
+395012
+1082463
+509437
+1244925
+1139284
+403037
+119759
+854088
+67270
+573758
+4113965
+586548
+303161
+148140
+672696
+386546
+1122797
+476954
+582995
+450075
+610381
+19431
+36128
+268680
+15512
+951642
+139120
+614238
+576630
+2517294
+1016914
+143626
+2834722
+571996
+304500
+913936
+1642
+826432
+128205
+409462
+1521475
+2802522
+3239488
+2156153
+1276167
+635094
+1073647
+17045
+153916
+1054129
+283368
+785419
+54467
+69547
+800926
+261352
+384536
+2825648
+1343652
+343941
+739357
+4131346
+1214670
+557912
+1222011
+549678
+34509
+1773293
+505102
+561015
+981910
+4390884
+65974
+2337513
+715952
+1240446
+1662306
+3441233
+950437
+1094564
+2921776
+319528
+30056
+950443
+1867855
+1675725
+278659
+200718
+3306228
+234812
+235830
+448461
+279219
diff --git tests/split/exp-intseed-buz32-512 tests/split/exp-intseed-buz32-512
new file mode 100644
index 000000000..8f61a31c2
--- /dev/null
+++ tests/split/exp-intseed-buz32-512
@@ -0,0 +1,130 @@
+901571
+365482
+72628
+821042
+910820
+332384
+1032465
+813565
+4020273
+1573720
+1491645
+333015
+4618054
+1901839
+923096
+354617
+2397487
+37980
+215446
+77070
+280103
+582089
+3637850
+636324
+251402
+511092
+368466
+1058795
+5553955
+466898
+927082
+21309
+843828
+2330786
+2271093
+563284
+2432018
+741549
+459347
+2815254
+155453
+945198
+79690
+492250
+1824999
+161030
+1598423
+270144
+622398
+171008
+1693823
+716491
+353914
+505260
+815146
+501107
+1957425
+976367
+283807
+296236
+180746
+688132
+3013980
+234047
+341472
+317683
+1733333
+2932063
+1543090
+1017108
+1400382
+297925
+138912
+1704314
+539898
+43418
+317913
+2443060
+1578666
+101319
+1149284
+1450393
+1049257
+582090
+2932002
+1749174
+669164
+178327
+304003
+1761305
+419613
+103364
+1192514
+899910
+1475376
+376087
+96774
+1647879
+828148
+2643623
+1117150
+828109
+532587
+382147
+276810
+98777
+1179708
+404884
+747790
+240162
+2107758
+46937
+759261
+1714639
+947181
+643796
+744813
+493898
+674027
+5680272
+953603
+70125
+1030350
+658131
+1337744
+275700
+527162
+1937342
+490049
+874176
diff --git tests/split/exp-intseed-buz64 tests/split/exp-intseed-buz64
new file mode 100644
index 000000000..94b340780
--- /dev/null
+++ tests/split/exp-intseed-buz64
@@ -0,0 +1,135 @@
+2720163
+43382
+1323758
+821176
+25348
+3693097
+4299279
+45784
+653767
+304443
+1797028
+90350
+1147106
+2361035
+599184
+1339004
+2420219
+354821
+67171
+888025
+836055
+2199392
+1042436
+48897
+34196
+774314
+12225
+1486331
+1026678
+2027891
+810574
+1156576
+975391
+508640
+2509297
+2643304
+3926535
+1030416
+269323
+626259
+26835
+1409420
+393781
+329520
+329700
+7961
+1353438
+1860472
+22314
+64758
+1075831
+2649055
+92406
+163232
+3755259
+1245826
+58396
+1207419
+342701
+212675
+66168
+3229366
+200299
+777286
+155404
+2349355
+425295
+302483
+2944148
+1920144
+86206
+1699044
+489625
+844524
+2418623
+672774
+13402
+729836
+184526
+946001
+374196
+938260
+1409131
+476084
+176498
+2076799
+1188681
+1804128
+490418
+378184
+52130
+357863
+1329287
+2297693
+15585
+202988
+143706
+413143
+159955
+2172289
+429130
+265846
+180490
+661587
+399289
+3950124
+755268
+1056635
+652127
+1949565
+1421794
+87798
+604064
+1235431
+919732
+120592
+189503
+1235988
+156256
+1381452
+1476313
+480482
+1240167
+2031273
+125885
+1968109
+744753
+1594843
+742663
+1518320
+129479
+1416736
+241990
+203519
+199129
diff --git tests/split/exp-intseed-buz64-42 tests/split/exp-intseed-buz64-42
new file mode 100644
index 000000000..3b53bbc55
--- /dev/null
+++ tests/split/exp-intseed-buz64-42
@@ -0,0 +1,129 @@
+672223
+1549924
+2006310
+748193
+1456122
+271088
+1180572
+804104
+316054
+125793
+582654
+1632469
+250245
+413576
+157746
+10147784
+10951
+291563
+177014
+861450
+869437
+371961
+943720
+616158
+2042856
+265640
+42210
+223127
+35387
+2041045
+1253318
+214163
+4189
+3385506
+202072
+632838
+5553
+2473867
+326354
+1219454
+368762
+1274219
+276408
+646804
+1009324
+1899966
+1168442
+1612969
+71076
+3402718
+3782405
+386902
+201576
+296497
+1436364
+959563
+120468
+887242
+139469
+838603
+995032
+1593871
+347783
+1477377
+677745
+603355
+1353784
+875549
+265493
+1995140
+468832
+858300
+25700
+232508
+704178
+27229
+213666
+2170442
+870825
+1407167
+1192455
+59080
+2210580
+572622
+205832
+334063
+399422
+516899
+1922032
+1146047
+1188061
+1906655
+264164
+531413
+3815666
+5752223
+854828
+829995
+1145103
+2389844
+863776
+2824734
+284955
+1731311
+88315
+110979
+381292
+580509
+1316797
+65103
+325088
+493286
+562203
+4194732
+307975
+954032
+1670882
+1058139
+741102
+1147362
+1675805
+2920350
+150853
+1474104
+277245
+128347
+1408429
+10951
+463645
diff --git tests/split/exp-intseed-buz64-512 tests/split/exp-intseed-buz64-512
new file mode 100644
index 000000000..13f84a960
--- /dev/null
+++ tests/split/exp-intseed-buz64-512
@@ -0,0 +1,139 @@
+1992499
+622133
+552085
+101258
+516069
+903132
+189812
+432899
+966667
+628377
+222642
+1167660
+556415
+111045
+104865
+923674
+193820
+532016
+185352
+28669
+1316744
+3500009
+392901
+3232059
+1362210
+1022944
+685778
+1224496
+101728
+1887811
+776775
+64249
+34341
+114006
+151315
+1685464
+333374
+1930149
+2602002
+13006
+751866
+844119
+263442
+1329058
+678600
+3294086
+443510
+2848795
+611964
+1452857
+44760
+11303
+925465
+236917
+315390
+1878553
+1071407
+906291
+646177
+1129160
+48758
+549358
+770876
+62085
+1614552
+726405
+958519
+54416
+93538
+216237
+1220373
+21563
+1364863
+245783
+3245591
+2542469
+1504713
+435757
+999887
+937245
+166822
+646878
+2921453
+2657633
+56264
+182270
+793679
+477719
+176762
+1647932
+745398
+514353
+2110283
+2719147
+1455667
+16750
+1882117
+2747604
+3644347
+712160
+1862543
+729224
+16171
+55397
+430013
+1945899
+400435
+1076381
+1928244
+1878693
+1904793
+2081067
+1476048
+327384
+1363748
+330097
+855646
+632484
+241223
+1843169
+102974
+372261
+5512
+119336
+185167
+2011511
+1925278
+232497
+442086
+1834914
+36071
+298145
+2083252
+1079269
+296062
+1537846
+1754540
+32046
+1555906
diff --git tests/split/exp-intseed-gear32 tests/split/exp-intseed-gear32
new file mode 100644
index 000000000..c38b0d759
--- /dev/null
+++ tests/split/exp-intseed-gear32
@@ -0,0 +1,124 @@
+1216420
+41216
+27644
+512443
+4275334
+4830
+2219048
+1191820
+2469443
+1184975
+817364
+204007
+691476
+391617
+34618
+79007
+725743
+3901076
+178657
+921770
+3717303
+1224090
+515141
+602648
+46219
+695094
+789338
+1696420
+181226
+426566
+1327687
+2614578
+1551796
+992835
+315486
+2022434
+668418
+2036896
+1246371
+1024293
+78221
+1457759
+1050792
+1607617
+2243860
+763606
+3763493
+515652
+1923041
+392903
+924515
+1104146
+729807
+839105
+132872
+513022
+1079546
+281425
+384869
+894598
+2568902
+634120
+554888
+2025566
+1317559
+973185
+289633
+369722
+300568
+1752096
+125855
+835574
+2148124
+207301
+773420
+2093424
+2787775
+2428862
+635706
+246625
+760765
+1018362
+1968538
+2078049
+548664
+291312
+1145194
+1790666
+875350
+1202076
+584396
+599910
+307957
+785980
+592883
+61746
+1017136
+2773443
+1276681
+288746
+1497837
+208037
+1111340
+691164
+3079389
+373052
+1662763
+1026106
+1070793
+1100606
+846477
+215568
+2871355
+29840
+35252
+538686
+483429
+3960294
+404235
+984805
+141680
+538832
+2825124
+20109
diff --git tests/split/exp-intseed-gear64 tests/split/exp-intseed-gear64
new file mode 100644
index 000000000..92f2977e0
--- /dev/null
+++ tests/split/exp-intseed-gear64
@@ -0,0 +1,142 @@
+169938
+742758
+19515
+230293
+1013564
+823013
+372554
+276579
+239829
+1291642
+124844
+487538
+44802
+676729
+819803
+891492
+1296896
+770258
+93832
+1428363
+916185
+650974
+249618
+1590145
+199971
+211741
+5315618
+510438
+5390166
+1499820
+1592814
+470256
+627447
+1995742
+434397
+826543
+4174758
+722692
+1275671
+1094242
+106197
+206508
+707236
+635171
+417031
+1172504
+237496
+2455875
+2165730
+437273
+2184378
+1582606
+2254616
+397695
+1211988
+275084
+83580
+86777
+1971887
+1997204
+704217
+1434199
+118769
+897956
+109370
+3586815
+2813915
+9443
+1698315
+522770
+163960
+111243
+527270
+1384456
+126444
+799945
+60190
+385961
+1926145
+643059
+1593799
+1024004
+103203
+45453
+897293
+359827
+603891
+1071705
+502008
+2252290
+270564
+109652
+471641
+551023
+470950
+1793162
+626209
+764051
+765795
+756601
+906351
+1501127
+523073
+753857
+105295
+425382
+88843
+1609149
+2070498
+860741
+747923
+1476488
+1398529
+1615217
+1402091
+639738
+401855
+1360586
+288319
+197643
+883347
+640564
+5513
+2401755
+982953
+165243
+611415
+930868
+662904
+644576
+2166169
+803756
+1022398
+47238
+1789815
+373560
+1335425
+2158366
+398758
+127917
+915859
+1568745
diff --git tests/split/random-source.sh tests/split/random-source.sh
new file mode 100755
index 000000000..0739f8e16
--- /dev/null
+++ tests/split/random-source.sh
@@ -0,0 +1,71 @@
+#!/bin/sh
+# show that --random-source for BUZHash is quite funky in 'split'.
+
+# Copyright (C) 2026 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <https://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/tests/init.sh"; path_prepend_ ./src
+print_ver_ split
+openssl version || skip_ 'openssl required'
+
+zero_bytes_ 1M | tr '\0' '\377' > ffff-1M \
+  && zero_bytes_ 4096 > zero-4K \
+  && zero_bytes_ 4095 > zero-4K-1 \
+  && zero_bytes_ 8192 > zero-8K \
+  && zero_bytes_ 8191 > zero-8K-1 \
+  && same_bytes_ 1024 "$seed_c" > rand-1K \
+  && same_bytes_ 1023 "$seed_c" > rand-1K-1 \
+  && same_bytes_ 2048 "$seed_c" > rand-2K \
+  && same_bytes_ 2047 "$seed_c" > rand-2K-1 \
+  && same_bytes_ 4K   "$seed_c" > rand-4K \
+  && same_bytes_ 5K   "$seed_c" > rand-5K \
+  && same_bytes_ 8K   "$seed_c" > rand-8K \
+  && same_bytes_ 10K  "$seed_c" > rand-10K \
+  && printf '' > input \
+  || framework_failure_
+
+# Absolutely minimal size for buz32 is 4K...
+returns_ 1 split -b buz32/1M --random-source zero-4K-1 input || fail=1
+returns_ 0 split -b buz32/1M --random-source zero-4K input || fail=1
+
+# and it's 8K for buz64..
+returns_ 1 split -b buz64/1M --random-source zero-8K-1 input || fail=1
+returns_ 0 split -b buz64/1M --random-source zero-8K input || fail=1
+
+# but it breaks with high-entropy source as sampling RNG throws few bytes away:
+returns_ 1 split -b buz32/1M --random-source rand-4K input || fail=1
+returns_ 1 split -b buz64/1M --random-source rand-8K input || fail=1
+
+# and BUZHash has no upper bound if random-source degrades to stream of \xFF:
+returns_ 1 split -b buz32/1M --random-source ffff-1M input || fail=1
+returns_ 1 split -b buz64/1M --random-source ffff-1M input || fail=1
+
+# However, high-entropy 5K and 10K (values from randperm_bound) are okayish.
+# They're tested as that's the values in the doc as well.
+returns_ 0 split -b buz32/1M --random-source rand-5K input || fail=1
+returns_ 0 split -b buz64/1M --random-source rand-10K input || fail=1
+
+# GearHash needs 1K and 2K, but it explicitly demands non-zero entopy:
+returns_ 1 split -b gear32/1M --random-source rand-1K-1 input || fail=1
+returns_ 0 split -b gear32/1M --random-source rand-1K input || fail=1
+returns_ 1 split -b gear32/1M --random-source zero-8K input || fail=1
+returns_ 1 split -b gear32/1M --random-source ffff-1M input || fail=1
+
+returns_ 1 split -b gear64/1M --random-source rand-2K-1 input || fail=1
+returns_ 0 split -b gear64/1M --random-source rand-2K input || fail=1
+returns_ 1 split -b gear64/1M --random-source zero-8K input || fail=1
+returns_ 1 split -b gear64/1M --random-source ffff-1M input || fail=1
+
+Exit $fail
-- 
2.34.1

From 02b42ae4aeffe8c4737317779570f47a7acb0ca4 Mon Sep 17 00:00:00 2001
From: Leonid Evdokimov <[email protected]>
Date: Mon, 2 Mar 2026 00:44:53 +0300
Subject: [PATCH 2/3] maint: split: drop duplicated code

The check already exists in io_blksize ()
---
 src/split.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git src/split.c src/split.c
index 474141c30..d8a213eed 100644
--- src/split.c
+++ src/split.c
@@ -2268,11 +2268,7 @@ main (int argc, char **argv)
     error (EXIT_FAILURE, errno, "%s", quotef (infile));
 
   if (in_blk_size == 0)
-    {
-      in_blk_size = io_blksize (&in_stat_buf);
-      if (SYS_BUFSIZE_MAX < in_blk_size)
-        in_blk_size = SYS_BUFSIZE_MAX;
-    }
+    in_blk_size = io_blksize (&in_stat_buf);
   int const buz_window_max = MIN (in_blk_size, IO_BUFSIZE);
   if (split_type == type_bytes_cdc && cdc_isbuz (cdc_type)
       && buz_window_max < w_units)
-- 
2.34.1

[PATCH] split: add content-defined chunking to --bytes

Reply via email to