bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length?

2015-05-12 Thread Pádraig Brady
On 06/05/15 11:53, Pádraig Brady wrote:
 On 06/05/15 05:29, Ben Rusholme wrote:
 As you say, this can always be fixed by the --suffix-length argument, but 
 it’s only required for certain combinations of FROM and CHUNK, (and “split” 
 already has all the information it needs).

 Now you could bump the suffix length based on the start number,
 though I don't think we should as that would impact on future
 processing (ordering) of the resultant files.  I.E. specifying
 a FROM value to --numeric-suffixes should only impact the
 start value, rather than the width.

 Could you clarify this for me? Doesn’t the zero-padding ensure correct 
 processing order?
 
 There are two use cases supported by specifying FROM.
 1. Setting the start for a single run (FROM is usually 1 in this case)
 2. Setting the offset for multiple independent split runs.
 In the second case we can't infer the size of the total set
 in any particular run, and thus require that --suffix-length is specified 
 appropriately.
 I.E. for multiple independent runs, the suffix length needs to be
 fixed width across the entire set for the total ordering to be correct.
 
 
 Things we could change are...
 
 1. Special case FROM=1 to assume a single run and thus
 enable auto suffix expansion or appropriately sized suffix with CHUNK.
 This would be a backwards incompat change and also not
 guaranteed a single run, so I'm reluctant to do that.
 
 2. Give an early error with specified FROM and CHUNK
 that would overflow the suffix size for CHUNK.
 This would save some processing, though doesn't add
 any protections against latent issues. I.E. you still get
 the error which is dependent on the parameters rather than the input data 
 size.
 Therefore it's probably not worth the complication.
 
 3. Leave suffix length at 2 when both FROM and CHUNK are specified.
 In retrospect, this would probably have been the best option
 to avoid ambiguities like this. However now we'd be breaking
 compat with scripts with FROM=1 and CHUNK=200 etc.
 While CHUNK values  100 would be unusual
 
 4. Auto set the suffix len based on FROM + CHUNK.
 That would support use case 1 (single run),
 but _silently_ break subsequent processing order
 of outputs from multiple split runs
 (as FROM is increased in multiples of CHUNK size).
 We could mitigate the _silent_ breakage though
 by limiting this change to when FROM  CHUNK.
 
 5. Document in man page and with more detail in info docs
 that -a is recommended when specifying FROM
 
 So I'll do 4 and 5 I think.

Attached.

cheers,
Pádraig

From 4d5e6c4f4a2ba8407420e56282c0d4e37b2691ee Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= p...@draigbrady.com
Date: Wed, 6 May 2015 01:48:40 +0100
Subject: [PATCH] split: auto set suffix len for --numeric-suffixes=N
 --number=N

Supporting `split --numeric-suffixes=1 -n100` for example.

* doc/coreutils.texi (split invocation): Mention the two
use cases for the FROM parameter, and the consequences on
the suffix length determination.
* src/split.c (set_suffix_length): Use the --numeric-suffixes
FROM parameter in the suffix width calculation, when it's
less than the number of files specified in --number.
* tests/split/suffix-auto-length.sh: Add test cases.
Fixes http://bugs.gnu.org/20511
---
 doc/coreutils.texi| 11 ---
 src/split.c   | 22 --
 tests/split/suffix-auto-length.sh | 21 -
 3 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 51d96b4..f887e04 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -3181,9 +3181,14 @@ specified, will auto increase the length by 2 as required.
 @opindex --numeric-suffixes
 Use digits in suffixes rather than lower-case letters.  The numerical
 suffix counts from @var{from} if specified, 0 otherwise.
-Note specifying a @var{from} value also disables the default
-auto suffix length expansion described above, and so you may also
-want to specify @option{-a} to allow suffixes beyond @samp{99}.
+
+@var{from} is used to either set the initial suffix for a single run,
+or to set the suffix offset for independently split inputs, and consequently
+the auto suffix length expansion described above is disabled.  Therefore
+you may also want to use option @option{-a} to allow suffixes beyond @samp{99}.
+Note if option @option{--number} is specified and the number of files is less
+than @var{from}, a single run is assumed and the minimum suffix length
+required is automatically determined.
 
 @item --additional-suffix=@var{suffix}
 @opindex --additional-suffix
diff --git a/src/split.c b/src/split.c
index 5d6043f..b6fe2dd 100644
--- a/src/split.c
+++ b/src/split.c
@@ -39,6 +39,7 @@
 #include sig2str.h
 #include xfreopen.h
 #include xdectoint.h
+#include xstrtol.h
 
 /* The official name of this program (e.g., no 'g' prefix).  */
 #define PROGRAM_NAME split
@@ -173,9 +174,26 @@ set_suffix_length (uintmax_t 

bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length?

2015-05-06 Thread Pádraig Brady
On 06/05/15 05:29, Ben Rusholme wrote:
 As you say, this can always be fixed by the --suffix-length argument, but 
 it’s only required for certain combinations of FROM and CHUNK, (and “split” 
 already has all the information it needs).
 
 Now you could bump the suffix length based on the start number,
 though I don't think we should as that would impact on future
 processing (ordering) of the resultant files.  I.E. specifying
 a FROM value to --numeric-suffixes should only impact the
 start value, rather than the width.
 
 Could you clarify this for me? Doesn’t the zero-padding ensure correct 
 processing order?

There are two use cases supported by specifying FROM.
1. Setting the start for a single run (FROM is usually 1 in this case)
2. Setting the offset for multiple independent split runs.
In the second case we can't infer the size of the total set
in any particular run, and thus require that --suffix-length is specified 
appropriately.
I.E. for multiple independent runs, the suffix length needs to be
fixed width across the entire set for the total ordering to be correct.


Things we could change are...

1. Special case FROM=1 to assume a single run and thus
enable auto suffix expansion or appropriately sized suffix with CHUNK.
This would be a backwards incompat change and also not
guaranteed a single run, so I'm reluctant to do that.

2. Give an early error with specified FROM and CHUNK
that would overflow the suffix size for CHUNK.
This would save some processing, though doesn't add
any protections against latent issues. I.E. you still get
the error which is dependent on the parameters rather than the input data size.
Therefore it's probably not worth the complication.

3. Leave suffix length at 2 when both FROM and CHUNK are specified.
In retrospect, this would probably have been the best option
to avoid ambiguities like this. However now we'd be breaking
compat with scripts with FROM=1 and CHUNK=200 etc.
While CHUNK values  100 would be unusual

4. Auto set the suffix len based on FROM + CHUNK.
That would support use case 1 (single run),
but _silently_ break subsequent processing order
of outputs from multiple split runs
(as FROM is increased in multiples of CHUNK size).
We could mitigate the _silent_ breakage though
by limiting this change to when FROM  CHUNK.

5. Document in man page and with more detail in info docs
that -a is recommended when specifying FROM

So I'll do 4 and 5 I think.

cheers,
Pádraig.





bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length?

2015-05-06 Thread Ben Rusholme
Hi,

 4. Auto set the suffix len based on FROM + CHUNK.
 That would support use case 1 (single run),
 but _silently_ break subsequent processing order
 of outputs from multiple split runs
 (as FROM is increased in multiples of CHUNK size).
 We could mitigate the _silent_ breakage though
 by limiting this change to when FROM  CHUNK.
 
 5. Document in man page and with more detail in info docs
 that -a is recommended when specifying FROM
 
 So I'll do 4 and 5 I think.

Thanks, that would solve the problem I was having.

Please feel free to end this conversation here, but if you can spare the time 
I’d be very interested in an example of a multiple split run for my own 
education/understanding/curiosity? I assume you mean processing subsets of the 
input, but can’t see how to do that (after experimenting on the command line 
and searching the documentation) except —number=l/k/n which does know the size 
of the total set?

Thanks again, Ben






bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length?

2015-05-06 Thread Pádraig Brady
On 06/05/15 18:37, Ben Rusholme wrote:
 Hi,
 
 4. Auto set the suffix len based on FROM + CHUNK.
 That would support use case 1 (single run),
 but _silently_ break subsequent processing order
 of outputs from multiple split runs
 (as FROM is increased in multiples of CHUNK size).
 We could mitigate the _silent_ breakage though
 by limiting this change to when FROM  CHUNK.

 5. Document in man page and with more detail in info docs
 that -a is recommended when specifying FROM

 So I'll do 4 and 5 I think.
 
 Thanks, that would solve the problem I was having.
 
 Please feel free to end this conversation here, but if you can spare the time 
 I’d be very interested in an example of a multiple split run for my own 
 education/understanding/curiosity? I assume you mean processing subsets of 
 the input, but can’t see how to do that (after experimenting on the command 
 line and searching the documentation) except —number=l/k/n which does know 
 the size of the total set?

Well you could process subsets but even more simply
consider splitting a set of input files in 2,
to a set of output files.

  i=0
  for f in *.dat; do
split -a4 --numeric=$i $f -n2; i=$(($i+2))
  done

(to be truely generic you would set the -a parameter
 based on the number of files and -n).

cheers,
Pádraig.





bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length?

2015-05-05 Thread Ben Rusholme
Hi,

 The info docs say about the --numeric-suffixes option:
 
  Note specifying a FROM value also disables the default auto suffix
  length expansion described above, and so you may also want to
  specify ‘-a’ to allow suffixes beyond ‘99’.

This does not seem to be the case, auto suffix works fine beyond 99  (in the 
current 8.23 release)?

$ seq 100  input.txt
$ split --numeric-suffixes=1234 --number=l/5678 input.txt
$ ls | tail
x6902
x6903
x6904
x6905
x6906
x6907
x6908
x6909
x6910
x6911

It just fails wherever FROM pushes CHUNKS over a multiple of 10:

$ rm x*
$ split --numeric-suffixes --number=l/1 input.txt
$ ls | tail -n 3
x9997
x9998
x
$
$ rm x*
$ split --numeric-suffixes=1 --number=l/1 input.txt
split: output file suffixes exhausted
$ ls | tail -n 3
x9997
x9998
x
$ ls | head -n 3
input.txt
x0001
x0002
$
$ rm x*
$ split --numeric-suffixes=2 --number=l/ input.txt
split: output file suffixes exhausted
$ ls | tail -n 3
x9997
x9998
x
$ ls | head -n 3
input.txt
x0002
x0003

As you say, this can always be fixed by the --suffix-length argument, but 
it’s only required for certain combinations of FROM and CHUNK, (and “split” 
already has all the information it needs).


 Now you could bump the suffix length based on the start number,
 though I don't think we should as that would impact on future
 processing (ordering) of the resultant files.  I.E. specifying
 a FROM value to --numeric-suffixes should only impact the
 start value, rather than the width.

Could you clarify this for me? Doesn’t the zero-padding ensure correct 
processing order?  I assume the crucial test is the inverse operation:

$ cat x*  output.txt
$ diff input.txt output.txt
$

Thanks, Ben






bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length?

2015-05-05 Thread Pádraig Brady
On 05/05/15 21:42, Ben Rusholme wrote:
 Hi,
 
 “split” (in the current GNU coreutils 8.23 release) does not account for the 
 optional start index (“split --numeric-suffixes=FROM”) when calculating 
 suffix length.
 
 I couldn’t find any prior reference to this problem in either the bug tracker 
 or mailing list archive.
 
 Thanks, Ben
 
 
 
 $ seq 100  input.txt
 $ split --numeric-suffixes --number=l/100 input.txt
 $ ls
 input.txt  x06  x13  x20  x27  x34  x41  x48  x55  x62  x69  x76  x83  x90  
 x97
 x00x07  x14  x21  x28  x35  x42  x49  x56  x63  x70  x77  x84  x91  
 x98
 x01x08  x15  x22  x29  x36  x43  x50  x57  x64  x71  x78  x85  x92  
 x99
 x02x09  x16  x23  x30  x37  x44  x51  x58  x65  x72  x79  x86  x93
 x03x10  x17  x24  x31  x38  x45  x52  x59  x66  x73  x80  x87  x94
 x04x11  x18  x25  x32  x39  x46  x53  x60  x67  x74  x81  x88  x95
 x05x12  x19  x26  x33  x40  x47  x54  x61  x68  x75  x82  x89  x96
 
 
 $ rm x*
 $ split --numeric-suffixes=1 --number=l/100 input.txt
 split: output file suffixes exhausted
 $ ls
 input.txt  x07  x14  x21  x28  x35  x42  x49  x56  x63  x70  x77  x84  x91  
 x98
 x01x08  x15  x22  x29  x36  x43  x50  x57  x64  x71  x78  x85  x92  
 x99
 x02x09  x16  x23  x30  x37  x44  x51  x58  x65  x72  x79  x86  x93
 x03x10  x17  x24  x31  x38  x45  x52  x59  x66  x73  x80  x87  x94
 x04x11  x18  x25  x32  x39  x46  x53  x60  x67  x74  x81  x88  x95
 x05x12  x19  x26  x33  x40  x47  x54  x61  x68  x75  x82  x89  x96
 x06x13  x20  x27  x34  x41  x48  x55  x62  x69  x76  x83  x90  x97
 $ # Should run from x001 to x100!
 
 
 $ rm x*
 $ split --numeric-suffixes=1 --number=l/101 input.txt
 $ ls
 input.txt  x008  x016  x024  x032  x040  x048  x056  x064  x072  x080  x088  
 x096
 x001   x009  x017  x025  x033  x041  x049  x057  x065  x073  x081  x089  
 x097
 x002   x010  x018  x026  x034  x042  x050  x058  x066  x074  x082  x090  
 x098
 x003   x011  x019  x027  x035  x043  x051  x059  x067  x075  x083  x091  
 x099
 x004   x012  x020  x028  x036  x044  x052  x060  x068  x076  x084  x092  
 x100
 x005   x013  x021  x029  x037  x045  x053  x061  x069  x077  x085  x093  
 x101
 x006   x014  x022  x030  x038  x046  x054  x062  x070  x078  x086  x094
 x007   x015  x023  x031  x039  x047  x055  x063  x071  x079  x087  x095

The info docs say about the --numeric-suffixes option:

  Note specifying a FROM value also disables the default auto suffix
  length expansion described above, and so you may also want to
  specify ‘-a’ to allow suffixes beyond ‘99’.

Now also specifying the fixed number of files with --number
auto sets the suffix length based on the number. I.E. when
you specified -nl/101 it bumped the suffix length to 3

Now you could bump the suffix length based on the start number,
though I don't think we should as that would impact on future
processing (ordering) of the resultant files.  I.E. specifying
a FROM value to --numeric-suffixes should only impact the
start value, rather than the width.

In other words if you were to split 2 files into 200 parts like:
  split--number=l/100 input1.txt
  split --numeric-suffixes=100 --number=l/100 input2.txt
Then you really need to be specifying -a3 to set
the suffix length appropriately.

We might be able to give an earlier error in this case,
and we should probably clarify the info docs a bit more.
I'll think about it.

cheers,
Pádraig.





bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length?

2015-05-05 Thread Ben Rusholme
Hi,

“split” (in the current GNU coreutils 8.23 release) does not account for the 
optional start index (“split --numeric-suffixes=FROM”) when calculating suffix 
length.

I couldn’t find any prior reference to this problem in either the bug tracker 
or mailing list archive.

Thanks, Ben



$ seq 100  input.txt
$ split --numeric-suffixes --number=l/100 input.txt
$ ls
input.txt  x06  x13  x20  x27  x34  x41  x48  x55  x62  x69  x76  x83  x90  x97
x00x07  x14  x21  x28  x35  x42  x49  x56  x63  x70  x77  x84  x91  x98
x01x08  x15  x22  x29  x36  x43  x50  x57  x64  x71  x78  x85  x92  x99
x02x09  x16  x23  x30  x37  x44  x51  x58  x65  x72  x79  x86  x93
x03x10  x17  x24  x31  x38  x45  x52  x59  x66  x73  x80  x87  x94
x04x11  x18  x25  x32  x39  x46  x53  x60  x67  x74  x81  x88  x95
x05x12  x19  x26  x33  x40  x47  x54  x61  x68  x75  x82  x89  x96


$ rm x*
$ split --numeric-suffixes=1 --number=l/100 input.txt
split: output file suffixes exhausted
$ ls
input.txt  x07  x14  x21  x28  x35  x42  x49  x56  x63  x70  x77  x84  x91  x98
x01x08  x15  x22  x29  x36  x43  x50  x57  x64  x71  x78  x85  x92  x99
x02x09  x16  x23  x30  x37  x44  x51  x58  x65  x72  x79  x86  x93
x03x10  x17  x24  x31  x38  x45  x52  x59  x66  x73  x80  x87  x94
x04x11  x18  x25  x32  x39  x46  x53  x60  x67  x74  x81  x88  x95
x05x12  x19  x26  x33  x40  x47  x54  x61  x68  x75  x82  x89  x96
x06x13  x20  x27  x34  x41  x48  x55  x62  x69  x76  x83  x90  x97
$ # Should run from x001 to x100!


$ rm x*
$ split --numeric-suffixes=1 --number=l/101 input.txt
$ ls
input.txt  x008  x016  x024  x032  x040  x048  x056  x064  x072  x080  x088  
x096
x001   x009  x017  x025  x033  x041  x049  x057  x065  x073  x081  x089  
x097
x002   x010  x018  x026  x034  x042  x050  x058  x066  x074  x082  x090  
x098
x003   x011  x019  x027  x035  x043  x051  x059  x067  x075  x083  x091  
x099
x004   x012  x020  x028  x036  x044  x052  x060  x068  x076  x084  x092  
x100
x005   x013  x021  x029  x037  x045  x053  x061  x069  x077  x085  x093  
x101
x006   x014  x022  x030  x038  x046  x054  x062  x070  x078  x086  x094
x007   x015  x023  x031  x039  x047  x055  x063  x071  x079  x087  x095