RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N

Leonid Evdokimov Wed, 15 Jan 2025 04:02:13 -0800

Hello.

I'd like to ask split maintainers if a patch implementing
content-defined chunking for split is a possibly interesting one or
not.


My use-case is a simple one: I need a content-defined chunker to put
100'000 versions of a ~500 MiB text file in a git repository to use
the excellent xdelta implementation in the git toolkit. But I'd like
to split the file at the very same place to keep xdelta efficient.

I've lurked around and I've found a few projects doing that, but most
of them are unmaintained for a while.

I think of implementing a patch for split(1) extending the current CLI
interface in the following way:

--hash-seed '... ' - defines seed for the rolling hash
--bytes h/i/N[W] - split if hash(window) % N = i, producing chunk of
SIZE=N on average
--line-bytes h/i/N[W] - similar, but preserving line boundary

If `i` is omitted (e.g. h/N[W]), it defaults to 0.
`W` is the width of the rolling hash window, it defaults to 0xFFF if
it's omitted (e.g. h/N), following the default value in the Borg
backup tool.

The CDC algorithm I think of is a BUZhash-based one as it allows the
use of an arbitrary window. And, maybe, SipHash for --line-bytes if it
has measurable performance gain over BUZhash in case of limited number
of cut points.

Would alike patch be considered for inclusion into split(1) or is it
something that is not generic enough for coreutils?

-- 
WBRBW, Leonid Evdokimov, https://darkk.net.ru tel:+79816800702
PGP: 6691 DE6B 4CCD C1C1 76A0  0D4A E1F2 A980 7F50 FAB2

RFC: split(1) and content-defined chunking, e.g. --bytes h/N & --line-bytes=h/N

Reply via email to