bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length?

Pádraig Brady Tue, 12 May 2015 18:22:33 -0700

On 06/05/15 11:53, Pádraig Brady wrote:
> On 06/05/15 05:29, Ben Rusholme wrote:
>> As you say, this can always be fixed by the "--suffix-length" argument, but 
>> it’s only required for certain combinations of FROM and CHUNK, (and “split” 
>> already has all the information it needs).
>>
>>> Now you could bump the suffix length based on the start number,
>>> though I don't think we should as that would impact on future
>>> processing (ordering) of the resultant files.  I.E. specifying
>>> a FROM value to --numeric-suffixes should only impact the
>>> start value, rather than the width.
>>
>> Could you clarify this for me? Doesn’t the zero-padding ensure correct 
>> processing order?
> 
> There are two use cases supported by specifying FROM.
> 1. Setting the start for a single run (FROM is usually 1 in this case)
> 2. Setting the offset for multiple independent split runs.
> In the second case we can't infer the size of the total set
> in any particular run, and thus require that --suffix-length is specified 
> appropriately.
> I.E. for multiple independent runs, the suffix length needs to be
> fixed width across the entire set for the total ordering to be correct.
> 
> 
> Things we could change are...
> 
> 1. Special case FROM=1 to assume a single run and thus
> enable auto suffix expansion or appropriately sized suffix with CHUNK.
> This would be a backwards incompat change and also not
> guaranteed a single run, so I'm reluctant to do that.
> 
> 2. Give an early error with specified FROM and CHUNK
> that would overflow the suffix size for CHUNK.
> This would save some processing, though doesn't add
> any protections against latent issues. I.E. you still get
> the error which is dependent on the parameters rather than the input data 
> size.
> Therefore it's probably not worth the complication.
> 
> 3. Leave suffix length at 2 when both FROM and CHUNK are specified.
> In retrospect, this would probably have been the best option
> to avoid ambiguities like this. However now we'd be breaking
> compat with scripts with FROM=1 and CHUNK=200 etc.
> While CHUNK values > 100 would be unusual
> 
> 4. Auto set the suffix len based on FROM + CHUNK.
> That would support use case 1 (single run),
> but _silently_ break subsequent processing order
> of outputs from multiple split runs
> (as FROM is increased in multiples of CHUNK size).
> We could mitigate the _silent_ breakage though
> by limiting this change to when FROM < CHUNK.
> 
> 5. Document in man page and with more detail in info docs
> that -a is recommended when specifying FROM
> 
> So I'll do 4 and 5 I think.


Attached.

cheers,
Pádraig

>From 4d5e6c4f4a2ba8407420e56282c0d4e37b2691ee Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <p...@draigbrady.com>
Date: Wed, 6 May 2015 01:48:40 +0100
Subject: [PATCH] split: auto set suffix len for --numeric-suffixes=<N
 --number=N

Supporting `split --numeric-suffixes=1 -n100` for example.

* doc/coreutils.texi (split invocation): Mention the two
use cases for the FROM parameter, and the consequences on
the suffix length determination.
* src/split.c (set_suffix_length): Use the --numeric-suffixes
FROM parameter in the suffix width calculation, when it's
less than the number of files specified in --number.
* tests/split/suffix-auto-length.sh: Add test cases.
Fixes http://bugs.gnu.org/20511
---
 doc/coreutils.texi                | 11 ++++++++---
 src/split.c                       | 22 ++++++++++++++++++++--
 tests/split/suffix-auto-length.sh | 21 ++++++++++++++++-----
 3 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 51d96b4..f887e04 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -3181,9 +3181,14 @@ specified, will auto increase the length by 2 as required.
 @opindex --numeric-suffixes
 Use digits in suffixes rather than lower-case letters.  The numerical
 suffix counts from @var{from} if specified, 0 otherwise.
-Note specifying a @var{from} value also disables the default
-auto suffix length expansion described above, and so you may also
-want to specify @option{-a} to allow suffixes beyond @samp{99}.
+
+@var{from} is used to either set the initial suffix for a single run,
+or to set the suffix offset for independently split inputs, and consequently
+the auto suffix length expansion described above is disabled.  Therefore
+you may also want to use option @option{-a} to allow suffixes beyond @samp{99}.
+Note if option @option{--number} is specified and the number of files is less
+than @var{from}, a single run is assumed and the minimum suffix length
+required is automatically determined.
 
 @item --additional-suffix=@var{suffix}
 @opindex --additional-suffix
diff --git a/src/split.c b/src/split.c
index 5d6043f..b6fe2dd 100644
--- a/src/split.c
+++ b/src/split.c
@@ -39,6 +39,7 @@
 #include "sig2str.h"
 #include "xfreopen.h"
 #include "xdectoint.h"
+#include "xstrtol.h"
 
 /* The official name of this program (e.g., no 'g' prefix).  */
 #define PROGRAM_NAME "split"
@@ -173,9 +174,26 @@ set_suffix_length (uintmax_t n_units, enum Split_type split_type)
   if (split_type == type_chunk_bytes || split_type == type_chunk_lines
       || split_type == type_rr)
     {
+      uintmax_t n_units_end = n_units;
+      if (numeric_suffix_start)
+        {
+          uintmax_t n_start;
+          strtol_error e = xstrtoumax (numeric_suffix_start, NULL, 10,
+                                       &n_start, "");
+          if (e == LONGINT_OK && n_start <= UINTMAX_MAX - n_units)
+            {
+              /* Restrict auto adjustment so we don't keep
+                 incrementing a suffix size arbitrarily,
+                 as that would break sort order for files
+                 generated from multiple split runs.  */
+              if (n_start < n_units)
+                n_units_end += n_start;
+            }
+
+        }
       size_t alphabet_len = strlen (suffix_alphabet);
-      bool alphabet_slop = (n_units % alphabet_len) != 0;
-      while (n_units /= alphabet_len)
+      bool alphabet_slop = (n_units_end % alphabet_len) != 0;
+      while (n_units_end /= alphabet_len)
         suffix_needed++;
       suffix_needed += alphabet_slop;
       suffix_auto = false;
diff --git a/tests/split/suffix-auto-length.sh b/tests/split/suffix-auto-length.sh
index 6ee2200..6ef221a 100755
--- a/tests/split/suffix-auto-length.sh
+++ b/tests/split/suffix-auto-length.sh
@@ -1,5 +1,5 @@
 #!/bin/sh
-# Test the suffix auto widening functionality
+# Test the suffix auto width functionality
 
 # Copyright (C) 2012-2015 Free Software Foundation, Inc.
 
@@ -20,14 +20,14 @@
 print_ver_ split
 
 
-# ensure this feature is off when start number specified
-truncate -s12 file.in
+# ensure auto widening is off when start number specified
+truncate -s12 file.in || framework_failure_
 returns_ 1 split file.in -b1 --numeric=89 || fail=1
 test "$(ls -1 x* | wc -l)" = 11 || fail=1
 rm -f x*
 
-# ensure this feature works when no start num specified
-truncate -s91 file.in
+# ensure auto widening works when no start num specified
+truncate -s91 file.in || framework_failure_
 for prefix in 'x' 'xx' ''; do
     for add_suffix in '.txt' ''; do
       split file.in "$prefix" -b1 --numeric --additional-suffix="$add_suffix" \
@@ -39,4 +39,15 @@ for prefix in 'x' 'xx' ''; do
     done
 done
 
+# ensure auto width with --number and start num < number of files
+# That's the single run use case which is valid to adjust suffix len
+truncate -s100 file.in || framework_failure_
+split --numeric-suffixes=1 --number=r/100 file.in || fail=1
+rm -f x*
+
+# ensure no auto width with --number and start num >= number of files
+# That's the multi run use case which is invalid to adjust suffix len
+# as that would result in an incorrect order for the total output file set
+returns_ 1 split --numeric-suffixes=100 --number=r/100 file.in || fail=1
+
 Exit $fail
-- 
2.3.4

bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length?

Reply via email to