Re: split behavior

Roger McNichols Mon, 14 Sep 2009 09:27:36 -0700


I found a machine with the old version of split.


home:~> uname -a
Linux home 2.2.13 #4 Thu May 8 23:11:31 CDT 2003 i686 unknown
home:~>
home:~> split --version
split (GNU textutils) 1.22
home:~>


Here's the result of 
home:~> cat /var/log/messages | split -2 - /tmp/x.

not exactly as I recalled. instead of adding zz first time, adds za but ends 
with yz,
then starts adding zz...  Anyway:

x.aa
x.ab
x.ac
x.ad
x.ae
x.af
x.ag
x.ah
x.ai
x.aj
...
x.yv
x.yw
x.yx
x.yy
x.yz
x.zaaa
x.zaab
x.zaac
x.zaad
x.zaae
x.zaaf
...
x.zyzt
x.zyzu
x.zyzv
x.zyzw
x.zyzx
x.zyzy
x.zyzz
x.zzaaaa
x.zzaaab
x.zzaaac
x.zzaaad
...


___________________________
Roger J. McNichols, Ph.D.
Chief Scientist
BioTex, Inc.
8058 El Rio St.
Houston, TX  77054
713.741.0111 (o)
713.741.0122 (f)
832.338.4371 (m)

----- Pádraig Brady <[email protected]> wrote:
> Roger McNichols wrote:
> > 
> > Thanks for the feedback.
> > 
> > 
> >> Do you mean select the appropriate suffix length based on size,
> >> or do you mean the zzaa, zzab scheme? The former wouldn't
> >> help when processing a pipe for example so I'd probably
> >> stick with the latter method for consistency.
> > 
> > Currently, split (at least 5.2.1) DOES pick the suffix size based on the 
> > file 
> > size when used as "split -<#> file" and the file size is known.
> 
> I checked the repo and can't see code supporting that.
> Perhaps you've got a locally modified `split` ?
> 
> > But as you 
> > point out, if the file is a pipe you may still run out of suffixes if the 
> > file size
> > changes after invocatio of slpit, or if split is used in the "split -<#> -" 
> > (reads stdin) mode, a 2-letter suffix is all you get unless you specify a 
> > length.
> > Now I suppose that maybe the discussion went something like:
> >   >> what if an unknown-sized input stream is the input?
> >   >> well then just use -a 100  and you will never* run out...
> >      (*note 26^100 is pretty big)
> > 
> > Anyway, I propose to develop a new commandline option that would invoke the 
> > 'old'
> > suffix formation behavior.  And even though aa ... zaa ... zzaa ... instead 
> > of 
> > aa .. zzaa ... zzzzaa (as well as many other schemes) would work just as 
> > well,
> 
> Bzzt. zaa would sort before zb
> In general one needs to append 'z'*suffix_len which would default to 2 if not 
> specified.
> One would need to consider this behaviour with digit suffixes also.
> 
> > I propose to utilize the 'old' one for the added advantage of reverse 
> > compatibility.
> 
> OK. While I like the scheme it would be really nice to see what we're being 
> compatible
> with. I.E. it would be great if you found where the old split you used came 
> from.
> 
> > That way any code that relied on the old scheme for counting would be able 
> > to be
> > re-functionalized with a simple addition of a commandline argument.
> > 
> >> if the suffix len is specified and is too small.
> >> Otherwise we use the zzaa, zzab method as described before.
> > 
> > This is also a good idea, but it might override the users intention which 
> > could 
> > be to use split to detect a file that was more that 676*N lines long or to 
> > use it 
> > with the -1 option and only write our the first 676 lines of the input 
> 
> That's exceedingly unlikely. It would be great to have the "unlimited" 
> behaviour
> by default I think. As mentioned before we could have the "limited" behaviour
> if POSIXLY_CORRECT is set.
> 
> > (who knows why, but we're fixing a fix that broke something else, right?)
> 
> I can't see the code for the old behaviour so I wouldn't assume that.
> 
> cheers,
> Pádraig.

Re: split behavior

Reply via email to