Re: [PATCH] split: --chunks option

Pádraig Brady Fri, 05 Feb 2010 04:42:35 -0800

I got a bit of time for the review last night...

This was your last interface change for this:


   -b, --bytes=SIZE        put SIZE bytes per output file\n\
+  -b, --bytes=/N          generate N output files\n\
+  -b, --bytes=K/N         print Kth of N chunks of file\n\
   -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file\n\
   -d, --numeric-suffixes  use numeric suffixes instead of alphabetic\n\
   -l, --lines=NUMBER      put NUMBER lines per output file\n\
+  -l, --lines=/N          generate N eol delineated output files\n\
+  -l, --lines=K/N         print Kth of N eol delineated chunks\n\
+  -n, --number=N          same as --bytes=/N\n\
+  -n, --number=K/N        same as --bytes=K/N\n\
+  -r, --round-robin=N     generate N eol delineated output files using\n\
+                            round-robin style distribution.\n\
+  -r. --round-robin=K/N   print Kth of N eol delineated chunk as -rN would\n\
+                            have generated.\n\
+  -t, --term=CHAR         specify CHAR as eol. This will also convert\n\
+                            -b to its line delineated equivalent (-C if\n\
+                            splitting normally, -l if splitting by\n\
+                            chunks). C escape sequences are accepted.\n\

Thinking more about it, I think adding 2 modes of operation to the
already slightly complicated -bCl options is too confusing.
Since this is a separate mode of operation; i.e. one would be
specifying a particular number of files for a different reason
than a particular size, it would be better as a separate option.
  So I changed -n to operate as follows. This is more general if
we want to add new split methods in future, and also compatible with
the existing BSD -n without needing a redundant option.

-n N       split into N files based on size of input
-n K/N     output K of N to stdout
-n l/N     split into N files while maintaining lines
-n l/K/N   output K of N to stdout while maintaining lines
-n r/N     like `l' but use round robin distribution instead of size
-n r/K/N   likewise but only output K of N to stdout

Other changes I made in the attached version are:

Removed the -t option as that's separate.
Removed erroneous 'c' from getopt() parameters.
Use K/N in code rather than M/N to match user instructions.
Added suffix len setter/checker based on N so that
  we fail immediately if the wrong -a is specified, or
  if -a is not specified we auto set it.
Flagged 0/N as an error, rather than treating like /N.
Changed r/K/N to buffer using stdio for much better performance (see below)
Fixed up the errno on some errors()
Normalized all "write error" messages so that all of these commands
output a single translated error message, of the form:
"split: write error: No space left on device"
  split -n 1/10 $(which split) >/dev/full
  stdbuf -o0 split -n 1/10 $(which split) >/dev/full
  seq 10 | split -n r/1/10 >/dev/full
  seq 10 | stdbuf -o0 split -n r/1/10 >/dev/full


Re the performance of the round robin implementation;
using stdio helps a LOT as can be seen with:
-------------------------------------------------------
$ time yes | head -n10000000 | ./split-fwrite -n r/1/1 | wc -l
10000000

real    0m1.568s
user    0m1.486s
sys     0m0.072s

$ time yes | head -n10000000 | ./split-write -n r/1/1 | wc -l
10000000

real    0m50.988s
user    0m7.548s
sys     0m43.250s
-------------------------------------------------------


I still need to look at the round robin implementation when
outputting to file rather than stdout. I may default to using
stdio, but give an option to flush each line. I'm testing
with this currently which is performing badly just doing write()
-------------------------------------------------------
#create fifos
yes | head -n4 | ../split -n r/4 fifo
for f in x*; do rm $f && mkfifo $f; done

#consumer
(for f in x*; do md5sum $f& done) > md5sum.out

#producer
seq 100000 | split -n r/4
-------------------------------------------------------

BTW, other modes perform well with write()
-------------------------------------------------------
$ yes | head -n10000000 > 10m.txt
$ time ./split -n l/1/1 <10m.txt | wc -l
10000000

real    0m0.201s
user    0m0.145s
sys     0m0.043s

$ time ./split -n 1/1 <10m.txt | wc -l
10000000

real    0m0.199s
user    0m0.154s
sys     0m0.041s

$ time ./split -n 1 <10m.txt

real    0m0.088s
user    0m0.000s
sys     0m0.081s
-------------------------------------------------------

Here is stuff I intend TODO before checking in:
 s/pread()/dd::skip()/ or at least add pread to bootstrap.conf
 fix info docs for reworked interface
 try to refactor duplicated code

cheers,
Pádraig.

>From 8a7fe06170ad8bb3050c4b6a43c9e51eb0ec22a7 Mon Sep 17 00:00:00 2001
From: Chen Guo <cheng...@yahoo.com>
Date: Fri, 8 Jan 2010 03:42:27 -0800
Subject: [PATCH] split: add --number to generate a particular number of files

* doc/coreutils.texi: update documentation of split.
* src/split.c (usage, long_options, main): New options --number.
(set_suffix_length): New function to auto increase suffix length
to handle a specified number of files.
(bytes_split): add max_files argument.  This allows for trivial
implementaton for byte chunking, similar to BSD.
(lines_chunk_split): new function.  Split file into chunks of lines.
(bytes_chunk_extract): new function.  Extract a chunk of file.
(lines_chunk_extract): new function.  Extract a chunk of lines.
(of_info): new struct.  Used by new functions lines_rr and ofd_check
to keep track of file descriptors associated with output files.
(ofd_check): new function.  Shuffle file descriptors in case output
files out number available file descriptors.
(lines_rr): new function.  Split file into chunks in round-robin
fashion.
(lines_rr_extract): new function.  Extract a chunk of file, as if
chunks were created in round-robin fashion.
(chunk_parse): new function.  Parses /N and K/N syntax.
* tests/Makefile.am: add new tests.
* misc/split-bchunk: new test for byte delineated chunking.
* misc/split-fail: add failure scenarios for new options.
* misc/split-l: change typo ln --version to split --version.
* misc/split-lchunk: new test for line delineated chunking.
* misc/split-rchunk: new test for round-robin chunking.
---
 doc/coreutils.texi      |   48 ++++-
 src/split.c             |  459 ++++++++++++++++++++++++++++++++++++++++++++++-
 tests/Makefile.am       |    3 +
 tests/misc/split-bchunk |   46 +++++
 tests/misc/split-fail   |    3 +-
 tests/misc/split-l      |    2 +-
 tests/misc/split-lchunk |   56 ++++++
 tests/misc/split-rchunk |   53 ++++++
 8 files changed, 649 insertions(+), 21 deletions(-)
 create mode 100755 tests/misc/split-bchunk
 create mode 100755 tests/misc/split-lchunk
 create mode 100755 tests/misc/split-rchunk

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index e3e95f5..41b02be 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -104,7 +104,7 @@
 * shuf: (coreutils)shuf invocation.             Shuffling text files.
 * sleep: (coreutils)sleep invocation.           Delay for a specified time.
 * sort: (coreutils)sort invocation.             Sort text files.
-* split: (coreutils)split invocation.           Split into fixed-size pieces.
+* split: (coreutils)split invocation.           Split into pieces.
 * stat: (coreutils)stat invocation.             Report file(system) status.
 * stdbuf: (coreutils)stdbuf invocation.         Modify stdio buffering.
 * stty: (coreutils)stty invocation.             Print/change terminal settings.
@@ -2623,7 +2623,7 @@ These commands output pieces of the input.
 @menu
 * head invocation::             Output the first part of files.
 * tail invocation::             Output the last part of files.
-* split invocation::            Split a file into fixed-size pieces.
+* split invocation::            Split a file into pieces.
 * csplit invocation::           Split a file into context-determined pieces.
 @end menu
 
@@ -2919,15 +2919,15 @@ mean either @samp{tail ./+4} or @samp{tail -n +4}.
 
 
 @node split invocation
-...@section @command{split}: Split a file into fixed-size pieces
+...@section @command{split}: Split a file into pieces.
 
 @pindex split
 @cindex splitting a file into pieces
 @cindex pieces, splitting a file into
 
-...@command{split} creates output files containing consecutive sections of
-...@var{input} (standard input if none is given or @var{input} is
-...@samp{-}).  Synopsis:
+...@command{split} creates output files containing consecutive or interleaved
+sections of @var{input}  (standard input if none is given or @var{input}
+is @samp{-}).  Synopsis:
 
 @example
 split [...@var{option}] [...@var{input} [...@var{prefix}]]
@@ -2940,10 +2940,9 @@ left over for the last section), into each output file.
 The output files' names consist of @var{prefix} (@samp{x} by default)
 followed by a group of characters (@samp{aa}, @samp{ab}, @dots{} by
 default), such that concatenating the output files in traditional
-sorted order by file name produces
-the original input file.  If the output file names are exhausted,
-...@command{split} reports an error without deleting the output files
-that it did create.
+sorted order by file name produces the original input file (except
+...@option{-r}).  If the output file names are exhausted, @command{split}
+reports an error without deleting the output files that it did create.
 
 The program accepts the following options.  Also see @ref{Common options}.
 
@@ -2959,6 +2958,13 @@ For compatibility @command{split} also supports an obsolete
 option syntax @optio...@var{lines}}.  New scripts should use @option{-l
 @var{lines}} instead.
 
+...@item -l [...@var{k}]/@var{chunks}
+...@item --line...@var{k}]/@var{chunks}
+If @var{k} is zero or omitted, divide @var{input} into @var{chunks}
+roughly equal-sized line delineated chunks.
+
+If @var{k} is present and nonzero, print @var{k}th of such chunks.
+
 @item -b @var{size}
 @itemx --byt...@var{size}
 @opindex -b
@@ -2966,6 +2972,13 @@ option syntax @optio...@var{lines}}.  New scripts should use @option{-l
 Put @var{size} bytes of @var{input} into each output file.
 @multiplierSuffixes{size}
 
+...@item -b [...@var{k}]/@var{chunks}
+...@itemx --byte...@var{k}]/@var{chunks}
+If @var{k} is zero or omitted, divide @var{input} into @var{chunks}
+equal-sized chunks.
+
+If @var{k} is present and nonzero, print @var{k}th of such chunks.
+
 @item -C @var{size}
 @itemx --line-byt...@var{size}
 @opindex -C
@@ -2975,6 +2988,21 @@ possible without exceeding @var{size} bytes.  Individual lines longer than
 @var{size} bytes are broken into multiple files.
 @var{size} has the same format as for the @option{--bytes} option.
 
+...@item -n [...@var{k}]/]...@var{chunks}
+...@itemx --number [...@var{k}]/]...@var{chunks}
+...@opindex -n
+...@opindex --number
+Same as @option{--byte...@var{k}]/@var{chunks}}, for BSD compatibility.
+
+...@item -r [...@var{k}]/]...@var{chunks}
+...@itemx --round-robin [...@var{k}]/]...@var{chunks}
+...@opindex -r
+...@opindex --round-robin
+If @var{k} is zero or omitted, distribute @var{input} lines round-robin
+style into @var{chunks} output files.
+
+If @var{k} is present and nonzero, print @var{k}th of such chunks.
+
 @item -a @var{length}
 @itemx --suffix-leng...@var{length}
 @opindex -a
diff --git a/src/split.c b/src/split.c
index 5bd9ebb..83c127a 100644
--- a/src/split.c
+++ b/src/split.c
@@ -44,8 +44,6 @@
   proper_name_utf8 ("Torbjorn Granlund", "Torbj\303\266rn Granlund"), \
   proper_name ("Richard M. Stallman")
 
-#define DEFAULT_SUFFIX_LENGTH 2
-
 /* Base name of output files.  */
 static char const *outbase;
 
@@ -57,7 +55,7 @@ static char *outfile;
 static char *outfile_mid;
 
 /* Length of OUTFILE's suffix.  */
-static size_t suffix_length = DEFAULT_SUFFIX_LENGTH;
+static size_t suffix_length;
 
 /* Alphabet of characters to use in suffix.  */
 static char const *suffix_alphabet = "abcdefghijklmnopqrstuvwxyz";
@@ -84,6 +82,7 @@ static struct option const longopts[] =
   {"bytes", required_argument, NULL, 'b'},
   {"lines", required_argument, NULL, 'l'},
   {"line-bytes", required_argument, NULL, 'C'},
+  {"number", required_argument, NULL, 'n'},
   {"suffix-length", required_argument, NULL, 'a'},
   {"numeric-suffixes", no_argument, NULL, 'd'},
   {"verbose", no_argument, NULL, VERBOSE_OPTION},
@@ -92,6 +91,32 @@ static struct option const longopts[] =
   {NULL, 0, NULL, 0}
 };
 
+static void
+set_suffix_length (size_t n_units)
+{
+#define DEFAULT_SUFFIX_LENGTH 2
+
+  size_t suffix_needed = 0;
+  size_t alphabet_len = strlen (suffix_alphabet);
+  bool alphabet_slop = (n_units % alphabet_len) != 0;
+  while (n_units /= alphabet_len)
+    suffix_needed++;
+  suffix_needed += alphabet_slop;
+
+  if (suffix_length)            /* set by user */
+    {
+      if (suffix_length < suffix_needed)
+        {
+          error (EXIT_FAILURE, 0,
+                 _("the suffix length needs to be at least %zu"),
+                 suffix_needed);
+        }
+      return;
+    }
+  else
+    suffix_length = MAX (DEFAULT_SUFFIX_LENGTH, suffix_needed);
+}
+
 void
 usage (int status)
 {
@@ -119,6 +144,7 @@ Mandatory arguments to long options are mandatory for short options too.\n\
   -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file\n\
   -d, --numeric-suffixes  use numeric suffixes instead of alphabetic\n\
   -l, --lines=NUMBER      put NUMBER lines per output file\n\
+  -n, --number=CHUNKS     generate CHUNKS output files.  See below\n\
 "), DEFAULT_SUFFIX_LENGTH);
       fputs (_("\
       --verbose           print a diagnostic just before each\n\
@@ -127,6 +153,15 @@ Mandatory arguments to long options are mandatory for short options too.\n\
       fputs (HELP_OPTION_DESCRIPTION, stdout);
       fputs (VERSION_OPTION_DESCRIPTION, stdout);
       emit_size_note ();
+fputs (_("\n\
+CHUNKS may be:\n\
+N       split into N files based on size of input\n\
+K/N     output K of N to stdout\n\
+l/N     split into N files while maintaining lines\n\
+l/K/N   output K of N to stdout while maintaining lines\n\
+r/N     like `l' but use round robin distribution instead of size\n\
+r/K/N   likewise but only output K of N to stdout\n\
+"), stdout);
       emit_ancillary_info ();
     }
   exit (status);
@@ -218,13 +253,14 @@ cwrite (bool new_file_flag, const char *bp, size_t bytes)
    Use buffer BUF, whose size is BUFSIZE.  */
 
 static void
-bytes_split (uintmax_t n_bytes, char *buf, size_t bufsize)
+bytes_split (uintmax_t n_bytes, char *buf, size_t bufsize, uintmax_t max_files)
 {
   size_t n_read;
   bool new_file_flag = true;
   size_t to_read;
   uintmax_t to_write = n_bytes;
   char *bp_out;
+  uintmax_t opened = 1;
 
   do
     {
@@ -251,7 +287,7 @@ bytes_split (uintmax_t n_bytes, char *buf, size_t bufsize)
               cwrite (new_file_flag, bp_out, w);
               bp_out += w;
               to_read -= w;
-              new_file_flag = true;
+              new_file_flag = !max_files || (opened++ < max_files);
               to_write = n_bytes;
             }
         }
@@ -362,6 +398,329 @@ line_bytes_split (size_t n_bytes)
   free (buf);
 }
 
+/* Split into NUMBER chunks of lines. */
+
+static void
+lines_chunk_split (size_t number, char *buf, size_t bufsize, size_t file_size)
+{
+  size_t n_read;
+  size_t chunk_no = 1;
+  off_t chunk_end = file_size / number - 1;
+  off_t offset = 0;
+  bool new_file_flag = true;
+  char *bp, *bp_out, *eob;
+
+  while (offset < file_size)
+    {
+      n_read = full_read (STDIN_FILENO, buf, bufsize);
+      if (n_read == SAFE_READ_ERROR)
+        error (EXIT_FAILURE, errno, "%s", infile);
+      bp = buf;
+      eob = buf + n_read;
+
+      while (1)
+        {
+          /* Begin looking for '\n' at last byte of chunk. */
+          bp_out = (offset < chunk_end) ? bp + chunk_end - offset : bp;
+          if (bp_out > eob)
+            bp_out = eob;
+          bp_out = memchr (bp_out, '\n', eob - bp_out);
+          if (!bp_out)
+            {
+              /* Buffer exhausted. */
+              cwrite (new_file_flag, bp, eob - bp);
+              new_file_flag = false;
+              offset += eob - bp;
+              break;
+            }
+          else
+            bp_out++;
+
+          cwrite (new_file_flag, bp, bp_out - bp);
+          chunk_end = (++chunk_no < number) ?
+            chunk_end + file_size / number : file_size;
+          new_file_flag = true;
+          offset += bp_out - bp;
+          bp = bp_out;
+          /* A line could have been so long that it skipped
+             entire chunks. */
+          while (chunk_end < offset)
+            {
+              chunk_end += file_size / number;
+              chunk_no++;
+              /* Create blank file: this ensures NUMBER files are
+                 created. */
+              cwrite (true, bp, 0);
+            }
+        }
+    }
+}
+
+/* Extract Nth of TOTAL chunks. */
+
+static void
+bytes_chunk_extract (size_t n, size_t total, char *buf, size_t bufsize,
+                     size_t file_size)
+{
+  off_t start = (n == 0) ? 0 : (n - 1) * (file_size / total);
+  off_t end = (n == total) ? file_size : n * (file_size / total);
+  ssize_t n_read;
+  size_t n_write;
+
+  while (1)
+    {
+      n_read = pread (STDIN_FILENO, buf, bufsize, start);
+      if (n_read < 0)
+        error (EXIT_FAILURE, errno, "%s", infile);
+      n_write = (start + n_read <= end) ? n_read : end - start;
+      if (full_write (STDOUT_FILENO, buf, n_write) != n_write)
+        error (EXIT_FAILURE, errno, "%s", _("write error"));
+      start += n_read;
+      if (end <= start)
+        return;
+    }
+}
+
+/* Extract lines whose first byte is in the Nth of TOTAL chunks. */
+
+static void
+lines_chunk_extract (size_t n, size_t total, char *buf, size_t bufsize,
+                     size_t file_size)
+{
+  ssize_t n_read;
+  bool end_of_chunk = false;
+  bool skip = true;
+  char *bp = buf, *bp_out = buf, *eob;
+  off_t start;
+  off_t end;
+
+  /* For n != 1, start reading 1 byte before nth chunk of file. This is to
+     detect if the first byte of chunk is the first byte of a line. */
+  if (n == 1)
+    {
+      start = 0;
+      skip = false;
+    }
+  else
+    start = (n - 1) * (file_size / total) - 1;
+  end = (n == total) ? file_size - 1 : n * (file_size / total) - 1;
+
+  do
+    {
+      n_read = pread (STDIN_FILENO, buf, bufsize, start);
+      if (n_read < 0)
+        error (EXIT_FAILURE, errno, "%s", infile);
+      bp = buf;
+      bp_out = buf + n_read;
+      eob = bp_out;
+
+      /* Find starting point. */
+      if (skip)
+        {
+          bp = memchr (buf, '\n', n_read);
+          if (bp && bp - buf < end - start)
+            {
+              bp++;
+              skip = false;
+            }
+          else if (!bp && start + n_read < end)
+            {
+              start += n_read;
+              continue;
+            }
+          else
+            return;
+        }
+
+      /* Find ending point. */
+      if (end < start + n_read && end == file_size - 1)
+        end_of_chunk = true;
+      else if (start + n_read >= end)
+        {
+          bp_out = (buf + end - start < buf) ? buf : buf + end - start;
+          bp_out = memchr (bp_out, '\n', eob - bp_out);
+          if (bp_out)
+            {
+              bp_out++;
+              end_of_chunk = true;
+            }
+          else
+            bp_out = eob;
+        }
+
+      if (write (STDOUT_FILENO, bp, bp_out - bp) != bp_out - bp)
+        error (EXIT_FAILURE, errno, _("write error"));
+      start += n_read;
+    }
+  while (!end_of_chunk);
+}
+
+
+
+typedef struct of_info
+{
+  char *of_name;
+  int ofd;
+} of_t;
+
+/* Rotates file descriptors when we're writing to more output files than we
+   have available file descriptors. */
+
+static void
+ofd_check (of_t * ofiles, size_t i, size_t n)
+{
+  if (0 < ofiles[i].ofd)
+    return;
+  else
+    {
+      int fd;
+      int j = i - 1;
+
+      /* Another process could have opened a file in between the calls to
+         close and open, so we should keep trying until open succeeds or
+         we've closed all of our files. */
+      while (1)
+        {
+          /* Attempt to open file. */
+          fd = open (ofiles[i].of_name,
+                     O_WRONLY | O_CREAT | O_TRUNC | O_BINARY,
+                     (S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP
+                      | S_IROTH | S_IWOTH));
+          if (-1 < fd)
+            break;
+          /* Find an open file to close. */
+          while (ofiles[j].ofd < 0)
+            {
+              if (--j == 0)
+                j = n - 1;
+              /* No more open files to close, exit with failure. */
+              if (j == i)
+                error (EXIT_FAILURE, EMFILE, "%s", ofiles[i].of_name);
+            }
+          close (ofiles[j].ofd);
+        }
+      ofiles[i].ofd = fd;
+    }
+}
+
+/* Divide file into N chunks in round robin fashion. */
+
+static void
+lines_rr (size_t n, char *buf, size_t bufsize)
+{
+  of_t *ofiles = xnmalloc (n, sizeof *ofiles);
+  char *bp, *bp_out, *eob;
+  size_t n_read;
+  bool eof = false;
+  bool nextfile = false;
+  size_t i;
+
+  /* Generate output file names. */
+  for (i = 0; i < n; i++)
+    {
+      next_file_name ();
+      ofiles[i].of_name = xstrdup (outfile);
+      ofiles[i].ofd = -1;
+    }
+  i = 0;
+
+  do
+    {
+      n_read = full_read (STDIN_FILENO, buf, bufsize);
+      if (n_read == SAFE_READ_ERROR)
+        error (EXIT_FAILURE, errno, "%s", infile);
+      if (n_read < bufsize)
+        {
+          if (n_read == 0)
+            break;
+          eof = true;
+        }
+      bp = buf;
+      eob = buf + n_read;
+
+
+      while (bp != eob)
+        {
+          /* Find end of line. */
+          bp_out = memchr (bp, '\n', eob - bp);
+          if (bp_out)
+            {
+              bp_out++;
+              nextfile = true;
+            }
+          else
+            bp_out = eob;
+
+          /* Secure file descriptor. */
+          ofd_check (ofiles, i, n);
+
+          if (full_write (ofiles[i].ofd, bp, bp_out - bp) != bp_out - bp)
+            error (EXIT_FAILURE, errno, "%s", ofiles[i].of_name);
+          if (nextfile && ++i == n)
+            i = 0;
+          bp = bp_out;
+          nextfile = false;
+        }
+    }
+  while (!eof);
+
+  /* Close any open file descriptors. */
+  for (i = 0; i < n; i++)
+    if (-1 < ofiles[i].ofd)
+      close (ofiles[i].ofd);
+}
+
+/* Extract Nth of TOT round robin distributed chunks of lines */
+
+static void
+lines_rr_extract (uintmax_t n, uintmax_t tot, char *buf, size_t bufsize)
+{
+  int line_no = 1;
+  char *bp, *bp_out, *eob;
+  size_t n_read;
+  bool eof = false;
+  bool inc = false;
+
+  do
+    {
+      n_read = full_read (STDIN_FILENO, buf, bufsize);
+      if (n_read == SAFE_READ_ERROR)
+        error (EXIT_FAILURE, errno, "%s", infile);
+      if (n_read != bufsize)
+        {
+          if (n_read == 0)
+            break;
+          eof = true;
+        }
+      bp = buf;
+      eob = buf + n_read;
+
+      while (bp != eob)
+        {
+          /* Find end of line. */
+          bp_out = memchr (bp, '\n', eob - bp);
+          if (bp_out)
+            {
+              bp_out++;
+              inc = true;
+            }
+          else
+            bp_out = eob;
+
+          if (line_no == n && fwrite (bp, bp_out - bp, 1, stdout) != 1)
+            {
+              clearerr (stdout); /* So close_stdout() doesn't also print.  */
+              error (EXIT_FAILURE, errno, _("write error"));
+            }
+          if (inc)
+            line_no = (line_no == tot) ? 1 : line_no + 1;
+          bp = bp_out;
+          inc = false;
+        }
+    }
+  while (!eof);
+}
+
 #define FAIL_ONLY_ONE_WAY()					\
   do								\
     {								\
@@ -370,21 +729,47 @@ line_bytes_split (size_t n_bytes)
     }								\
   while (0)
 
+/* Parse K/N syntax of chunk options. */
+
+static void
+chunk_parse (uintmax_t *k_units, uintmax_t *n_units, char *slash)
+{
+  *slash = '\0';
+  if (slash != optarg           /* a leading number is specified.  */
+      && (xstrtoumax (optarg, NULL, 10, k_units, "") != LONGINT_OK
+          || *k_units == 0 || SIZE_MAX < *k_units))
+    {
+      error (0, 0, _("%s: invalid chunk number"), optarg);
+      usage (EXIT_FAILURE);
+    }
+  if (xstrtoumax (++slash, NULL, 10, n_units, "") != LONGINT_OK
+      || *n_units == 0 || *n_units < *k_units || SIZE_MAX < *n_units)
+    {
+      error (0, 0, _("%s: invalid number of chunks"), slash);
+      usage (EXIT_FAILURE);
+    }
+}
+
+
 int
 main (int argc, char **argv)
 {
   struct stat stat_buf;
   enum
     {
-      type_undef, type_bytes, type_byteslines, type_lines, type_digits
+      type_undef, type_bytes, type_byteslines, type_lines, type_digits,
+      type_chunk_bytes, type_chunk_lines, type_rr
     } split_type = type_undef;
   size_t in_blk_size;		/* optimal block size of input file device */
   char *buf;			/* file i/o buffer */
   size_t page_size = getpagesize ();
+  uintmax_t k_units = 0;
   uintmax_t n_units;
   static char const multipliers[] = "bEGKkMmPTYZ0";
   int c;
   int digits_optind = 0;
+  size_t file_size;
+  char *slash;
 
   initialize_main (&argc, &argv);
   set_program_name (argv[0]);
@@ -404,7 +789,7 @@ main (int argc, char **argv)
       /* This is the argv-index of the option we will read next.  */
       int this_optind = optind ? optind : 1;
 
-      c = getopt_long (argc, argv, "0123456789C:a:b:dl:", longopts, NULL);
+      c = getopt_long (argc, argv, "0123456789C:a:b:dl:n:", longopts, NULL);
       if (c == -1)
         break;
 
@@ -459,6 +844,34 @@ main (int argc, char **argv)
             }
           break;
 
+        case 'n':
+          if (split_type != type_undef)
+            FAIL_ONLY_ONE_WAY ();
+          /* skip any whitespace */
+          while (isspace (to_uchar (*optarg)))
+            optarg++;
+          if (strncmp (optarg, "r/", 2) == 0)
+            {
+              split_type = type_rr;
+              optarg += 2;
+            }
+          else if (strncmp (optarg, "l/", 2) == 0)
+            {
+              split_type = type_chunk_lines;
+              optarg += 2;
+            }
+          else
+            split_type = type_chunk_bytes;
+          if ((slash = strchr (optarg, '/')))
+            chunk_parse (&k_units, &n_units, slash);
+          else if (xstrtoumax (optarg, NULL, 10, &n_units, "") != LONGINT_OK
+                   || n_units == 0 || SIZE_MAX < n_units)
+            {
+              error (0, 0, _("%s: invalid number of chunks"), optarg);
+              usage (EXIT_FAILURE);
+            }
+          break;
+
         case '0':
         case '1':
         case '2':
@@ -514,10 +927,12 @@ main (int argc, char **argv)
 
   if (n_units == 0)
     {
-      error (0, 0, _("invalid number of lines: 0"));
+      error (0, 0, _("%s: invalid number of lines"), "0");
       usage (EXIT_FAILURE);
     }
 
+  set_suffix_length (n_units);
+
   /* Get out the filename arguments.  */
 
   if (optind < argc)
@@ -550,6 +965,11 @@ main (int argc, char **argv)
   if (fstat (STDIN_FILENO, &stat_buf) != 0)
     error (EXIT_FAILURE, errno, "%s", infile);
   in_blk_size = io_blksize (stat_buf);
+  file_size = stat_buf.st_size;
+
+  if (split_type == type_chunk_bytes || split_type == type_chunk_lines)
+    if (file_size < n_units)
+      error (EXIT_FAILURE, 0, _("number of chunks exceed file size"));
 
   buf = ptr_align (xmalloc (in_blk_size + 1 + page_size - 1), page_size);
 
@@ -561,13 +981,34 @@ main (int argc, char **argv)
       break;
 
     case type_bytes:
-      bytes_split (n_units, buf, in_blk_size);
+      bytes_split (n_units, buf, in_blk_size, 0);
       break;
 
     case type_byteslines:
       line_bytes_split (n_units);
       break;
 
+    case type_chunk_bytes:
+      if (k_units == 0)
+        bytes_split (file_size / n_units, buf, in_blk_size, n_units);
+      else
+        bytes_chunk_extract (k_units, n_units, buf, in_blk_size, file_size);
+      break;
+
+    case type_chunk_lines:
+      if (k_units == 0)
+        lines_chunk_split (n_units, buf, in_blk_size, file_size);
+      else
+        lines_chunk_extract (k_units, n_units, buf, in_blk_size, file_size);
+      break;
+
+    case type_rr:
+      if (k_units == 0)
+        lines_rr (n_units, buf, in_blk_size);
+      else
+        lines_rr_extract (k_units, n_units, buf, in_blk_size);
+      break;
+
     default:
       abort ();
     }
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 85503cc..c65f9dd 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -228,8 +228,11 @@ TESTS =						\
   misc/sort-rand				\
   misc/sort-version				\
   misc/split-a					\
+  misc/split-bchunk				\
   misc/split-fail				\
   misc/split-l					\
+  misc/split-lchunk				\
+  misc/split-rchunk				\
   misc/stat-fmt					\
   misc/stat-hyphen				\
   misc/stat-printf				\
diff --git a/tests/misc/split-bchunk b/tests/misc/split-bchunk
new file mode 100755
index 0000000..17d1f7e
--- /dev/null
+++ b/tests/misc/split-bchunk
@@ -0,0 +1,46 @@
+#!/bin/sh
+# show that splitting into 3 byte delineated chunks works.
+
+# Copyright (C) 2010 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+if test "$VERBOSE" = yes; then
+  set -x
+  split --version
+fi
+. $srcdir/test-lib.sh
+
+printf '1\n2\n3\n4\n5\n' > in || framework_failure
+
+split -n 3 in > out || fail=1
+split -n 1/3 in > b1 || fail=1
+split -n 2/3 in > b2 || fail=1
+split -n 3/3 in > b3 || fail=1
+echo -n -e 1'\n'2 > exp-1
+echo -e '\n'3 > exp-2
+echo -e 4'\n'5 > exp-3
+
+compare xaa exp-1 || fail=1
+compare xab exp-2 || fail=1
+compare xac exp-3 || fail=1
+compare b1 exp-1 || fail=1
+compare b2 exp-2 || fail=1
+compare b3 exp-3 || fail=1
+test -f xad && fail=1
+
+# Splitting into more chunks than file size should fail.
+split -n20 in 2> /dev/null && fail=1
+
+Exit $fail
diff --git a/tests/misc/split-fail b/tests/misc/split-fail
index e36c86d..981673b 100755
--- a/tests/misc/split-fail
+++ b/tests/misc/split-fail
@@ -29,8 +29,10 @@ touch in || framework_failure
 
 split -a 0 in 2> /dev/null || fail=1
 split -b 0 in 2> /dev/null && fail=1
+split -b /0 in 2> /dev/null && fail=1
 split -C 0 in 2> /dev/null && fail=1
 split -l 0 in 2> /dev/null && fail=1
+split -l /0 in 2> /dev/null && fail=1
 
 # Make sure -C doesn't create empty files.
 rm -f x?? || fail=1
@@ -64,5 +66,4 @@ split: line count option -99*... is too large
 EOF
 compare out exp || fail=1
 
-
 Exit $fail
diff --git a/tests/misc/split-l b/tests/misc/split-l
index fb07a27..850d5b5 100755
--- a/tests/misc/split-l
+++ b/tests/misc/split-l
@@ -18,7 +18,7 @@
 
 if test "$VERBOSE" = yes; then
   set -x
-  ln --version
+  split --version
 fi
 
 . $srcdir/test-lib.sh
diff --git a/tests/misc/split-lchunk b/tests/misc/split-lchunk
new file mode 100755
index 0000000..f672d3b
--- /dev/null
+++ b/tests/misc/split-lchunk
@@ -0,0 +1,56 @@
+#!/bin/sh
+# show that splitting into 3 newline delineated chunks works.
+
+# Copyright (C) 2010 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+if test "$VERBOSE" = yes; then
+  set -x
+  ln --version
+fi
+
+. $srcdir/test-lib.sh
+
+printf '1\n2\n3\n4\n5\n' > in || framework_failure
+
+split -n l/3 in > out || fail=1
+split -n l/1/3 in > l1 || fail=1
+split -n l/2/3 in > l2 || fail=1
+split -n l/3/3 in > l3 || fail=1
+
+cat <<\EOF > exp-1
+1
+2
+EOF
+cat <<\EOF > exp-2
+3
+EOF
+cat <<\EOF > exp-3
+4
+5
+EOF
+
+compare xaa exp-1 || fail=1
+compare xab exp-2 || fail=1
+compare xac exp-3 || fail=1
+compare l1 exp-1 || fail=1
+compare l2 exp-2 || fail=1
+compare l3 exp-3 || fail=1
+test -f xad && fail=1
+
+# Splitting into more chunks than file size should fail.
+split -n l/20 in 2> /dev/null && fail=1
+
+Exit $fail
diff --git a/tests/misc/split-rchunk b/tests/misc/split-rchunk
new file mode 100755
index 0000000..98e2f36
--- /dev/null
+++ b/tests/misc/split-rchunk
@@ -0,0 +1,53 @@
+#!/bin/sh
+# show that splitting into 3 round-robin chunks works.
+
+# Copyright (C) 2010 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+if test "$VERBOSE" = yes; then
+  set -x
+  ln --version
+fi
+
+. $srcdir/test-lib.sh
+
+printf '1\n2\n3\n4\n5\n' > in || framework_failure
+
+split -n r/3 in > out || fail=1
+split -n r/1/3 in > r1 || fail=1
+split -n r/2/3 in > r2 || fail=1
+split -n r/3/3 in > r3 || fail=1
+
+cat <<\EOF > exp-1
+1
+4
+EOF
+cat <<\EOF > exp-2
+2
+5
+EOF
+cat <<\EOF > exp-3
+3
+EOF
+
+compare xaa exp-1 || fail=1
+compare xab exp-2 || fail=1
+compare xac exp-3 || fail=1
+compare r1 exp-1 || fail=1
+compare r2 exp-2 || fail=1
+compare r3 exp-3 || fail=1
+test -f xad && fail=1
+
+Exit $fail
-- 
1.6.2.5

Re: [PATCH] split: --chunks option

Reply via email to