Re: split behavior

Pádraig Brady Fri, 11 Sep 2009 20:53:34 -0700

Roger McNichols wrote:
> 
> Currently using version 5.2.1 of coreutils 'split' command produces files 
> with 'intelligent' suffixes.  That is, the number of letters (or digits) 
> required
> is based on the known number of output files that will be required.


Actually coreutils does not employ 'intelligent' suffixes, as the
size of the input is not taken into account and the suffix length
defaults to 2. One could set it 'intelligently' outside of split using
something like the following. However this should really be done within split:

size=$(du -b "$file" | cut -f1)
chunk=4096
suffix_len=$(
  python -c "
import math as m
print int(m.ceil(m.log($size/$chunk,26)))
"
)
split -a$suffix_len "$file"

> An OLD version of split (and I dont know which one becuase I dont have it 
> anymore)
> used 'dumb' suffixes.  That is, it would start with aa, ab, ac, ..., ba, bb, 
> bc, ...
> util it got to zz and then would jump to zzaa, zzab, zzac, ... etc and then 
> on 
> to zzaaaa, zzaaab, zzaaac, etc...

I think I've seen this method before but it's not in solaris,
freebsd or alexautils? Grr that's bugging me now.
Whatever implementation of split that was, it seems like a
good way to split arbitrary sized input while file names
name sort lexically.

Also if the file size _is_ known but a suffix length that's too short
is specified, one could use this algorithm to ensure that you don't
get the "suffixes exhausted" error.
In fact, for consistency it would probably be better to always default
to 2 as the suffix len, and fall back to this zzaa suffix scheme rather
than "intelligently" select the suffix length as described above.

I'll look at doing this soon.

thanks,
Pádraig.

Re: split behavior

Reply via email to