On 06/05/15 11:53, Pádraig Brady wrote: > On 06/05/15 05:29, Ben Rusholme wrote: >> As you say, this can always be fixed by the "--suffix-length" argument, but >> it’s only required for certain combinations of FROM and CHUNK, (and “split” >> already has all the information it needs). >> >>> Now you could bump the suffix length based on the start number, >>> though I don't think we should as that would impact on future >>> processing (ordering) of the resultant files. I.E. specifying >>> a FROM value to --numeric-suffixes should only impact the >>> start value, rather than the width. >> >> Could you clarify this for me? Doesn’t the zero-padding ensure correct >> processing order? > > There are two use cases supported by specifying FROM. > 1. Setting the start for a single run (FROM is usually 1 in this case) > 2. Setting the offset for multiple independent split runs. > In the second case we can't infer the size of the total set > in any particular run, and thus require that --suffix-length is specified > appropriately. > I.E. for multiple independent runs, the suffix length needs to be > fixed width across the entire set for the total ordering to be correct. > > > Things we could change are... > > 1. Special case FROM=1 to assume a single run and thus > enable auto suffix expansion or appropriately sized suffix with CHUNK. > This would be a backwards incompat change and also not > guaranteed a single run, so I'm reluctant to do that. > > 2. Give an early error with specified FROM and CHUNK > that would overflow the suffix size for CHUNK. > This would save some processing, though doesn't add > any protections against latent issues. I.E. you still get > the error which is dependent on the parameters rather than the input data > size. > Therefore it's probably not worth the complication. > > 3. Leave suffix length at 2 when both FROM and CHUNK are specified. > In retrospect, this would probably have been the best option > to avoid ambiguities like this. However now we'd be breaking > compat with scripts with FROM=1 and CHUNK=200 etc. > While CHUNK values > 100 would be unusual > > 4. Auto set the suffix len based on FROM + CHUNK. > That would support use case 1 (single run), > but _silently_ break subsequent processing order > of outputs from multiple split runs > (as FROM is increased in multiples of CHUNK size). > We could mitigate the _silent_ breakage though > by limiting this change to when FROM < CHUNK. > > 5. Document in man page and with more detail in info docs > that -a is recommended when specifying FROM > > So I'll do 4 and 5 I think.
Attached. cheers, Pádraig
>From 4d5e6c4f4a2ba8407420e56282c0d4e37b2691ee Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= <p...@draigbrady.com> Date: Wed, 6 May 2015 01:48:40 +0100 Subject: [PATCH] split: auto set suffix len for --numeric-suffixes=<N --number=N Supporting `split --numeric-suffixes=1 -n100` for example. * doc/coreutils.texi (split invocation): Mention the two use cases for the FROM parameter, and the consequences on the suffix length determination. * src/split.c (set_suffix_length): Use the --numeric-suffixes FROM parameter in the suffix width calculation, when it's less than the number of files specified in --number. * tests/split/suffix-auto-length.sh: Add test cases. Fixes http://bugs.gnu.org/20511 --- doc/coreutils.texi | 11 ++++++++--- src/split.c | 22 ++++++++++++++++++++-- tests/split/suffix-auto-length.sh | 21 ++++++++++++++++----- 3 files changed, 44 insertions(+), 10 deletions(-) diff --git a/doc/coreutils.texi b/doc/coreutils.texi index 51d96b4..f887e04 100644 --- a/doc/coreutils.texi +++ b/doc/coreutils.texi @@ -3181,9 +3181,14 @@ specified, will auto increase the length by 2 as required. @opindex --numeric-suffixes Use digits in suffixes rather than lower-case letters. The numerical suffix counts from @var{from} if specified, 0 otherwise. -Note specifying a @var{from} value also disables the default -auto suffix length expansion described above, and so you may also -want to specify @option{-a} to allow suffixes beyond @samp{99}. + +@var{from} is used to either set the initial suffix for a single run, +or to set the suffix offset for independently split inputs, and consequently +the auto suffix length expansion described above is disabled. Therefore +you may also want to use option @option{-a} to allow suffixes beyond @samp{99}. +Note if option @option{--number} is specified and the number of files is less +than @var{from}, a single run is assumed and the minimum suffix length +required is automatically determined. @item --additional-suffix=@var{suffix} @opindex --additional-suffix diff --git a/src/split.c b/src/split.c index 5d6043f..b6fe2dd 100644 --- a/src/split.c +++ b/src/split.c @@ -39,6 +39,7 @@ #include "sig2str.h" #include "xfreopen.h" #include "xdectoint.h" +#include "xstrtol.h" /* The official name of this program (e.g., no 'g' prefix). */ #define PROGRAM_NAME "split" @@ -173,9 +174,26 @@ set_suffix_length (uintmax_t n_units, enum Split_type split_type) if (split_type == type_chunk_bytes || split_type == type_chunk_lines || split_type == type_rr) { + uintmax_t n_units_end = n_units; + if (numeric_suffix_start) + { + uintmax_t n_start; + strtol_error e = xstrtoumax (numeric_suffix_start, NULL, 10, + &n_start, ""); + if (e == LONGINT_OK && n_start <= UINTMAX_MAX - n_units) + { + /* Restrict auto adjustment so we don't keep + incrementing a suffix size arbitrarily, + as that would break sort order for files + generated from multiple split runs. */ + if (n_start < n_units) + n_units_end += n_start; + } + + } size_t alphabet_len = strlen (suffix_alphabet); - bool alphabet_slop = (n_units % alphabet_len) != 0; - while (n_units /= alphabet_len) + bool alphabet_slop = (n_units_end % alphabet_len) != 0; + while (n_units_end /= alphabet_len) suffix_needed++; suffix_needed += alphabet_slop; suffix_auto = false; diff --git a/tests/split/suffix-auto-length.sh b/tests/split/suffix-auto-length.sh index 6ee2200..6ef221a 100755 --- a/tests/split/suffix-auto-length.sh +++ b/tests/split/suffix-auto-length.sh @@ -1,5 +1,5 @@ #!/bin/sh -# Test the suffix auto widening functionality +# Test the suffix auto width functionality # Copyright (C) 2012-2015 Free Software Foundation, Inc. @@ -20,14 +20,14 @@ print_ver_ split -# ensure this feature is off when start number specified -truncate -s12 file.in +# ensure auto widening is off when start number specified +truncate -s12 file.in || framework_failure_ returns_ 1 split file.in -b1 --numeric=89 || fail=1 test "$(ls -1 x* | wc -l)" = 11 || fail=1 rm -f x* -# ensure this feature works when no start num specified -truncate -s91 file.in +# ensure auto widening works when no start num specified +truncate -s91 file.in || framework_failure_ for prefix in 'x' 'xx' ''; do for add_suffix in '.txt' ''; do split file.in "$prefix" -b1 --numeric --additional-suffix="$add_suffix" \ @@ -39,4 +39,15 @@ for prefix in 'x' 'xx' ''; do done done +# ensure auto width with --number and start num < number of files +# That's the single run use case which is valid to adjust suffix len +truncate -s100 file.in || framework_failure_ +split --numeric-suffixes=1 --number=r/100 file.in || fail=1 +rm -f x* + +# ensure no auto width with --number and start num >= number of files +# That's the multi run use case which is invalid to adjust suffix len +# as that would result in an incorrect order for the total output file set +returns_ 1 split --numeric-suffixes=100 --number=r/100 file.in || fail=1 + Exit $fail -- 2.3.4