Hello. I'd like to ask split maintainers if a patch implementing content-defined chunking for split is a possibly interesting one or not.
My use-case is a simple one: I need a content-defined chunker to put 100'000 versions of a ~500 MiB text file in a git repository to use the excellent xdelta implementation in the git toolkit. But I'd like to split the file at the very same place to keep xdelta efficient. I've lurked around and I've found a few projects doing that, but most of them are unmaintained for a while. I think of implementing a patch for split(1) extending the current CLI interface in the following way: --hash-seed '... ' - defines seed for the rolling hash --bytes h/i/N[W] - split if hash(window) % N = i, producing chunk of SIZE=N on average --line-bytes h/i/N[W] - similar, but preserving line boundary If `i` is omitted (e.g. h/N[W]), it defaults to 0. `W` is the width of the rolling hash window, it defaults to 0xFFF if it's omitted (e.g. h/N), following the default value in the Borg backup tool. The CDC algorithm I think of is a BUZhash-based one as it allows the use of an arbitrary window. And, maybe, SipHash for --line-bytes if it has measurable performance gain over BUZhash in case of limited number of cut points. Would alike patch be considered for inclusion into split(1) or is it something that is not generic enough for coreutils? -- WBRBW, Leonid Evdokimov, https://darkk.net.ru tel:+79816800702 PGP: 6691 DE6B 4CCD C1C1 76A0 0D4A E1F2 A980 7F50 FAB2