Re: [coreutils] coreutils-8.10 next week

Pádraig Brady Fri, 28 Jan 2011 14:49:21 -0800

On 28/01/11 18:10, Jim Meyering wrote:
> I would like to release coreutils-8.10 next week, including the changes
> on the fiemap-copy-2 branch (shortly to be extended).  I'll make a test
> release as soon as the FIEMAP changes have been merged to master.
> 
> If anyone has additional changes that they would like to see included,
> please let us know.


I intend to apply the attached:

join: ensure --header skips the order check with empty files
join: don't report disorder against an empty file
join: add -o 'auto' to output a constant number of fields per line
doc: add alternatives for field processing not supported by cut

cheers,
Pádraig.

>From c36b20e51f028d13d06c27d86c0cf009313bacd3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Fri, 14 Jan 2011 08:46:21 +0000
Subject: [PATCH] join: ensure --header skips the order check with empty files

* src/join.c: Skip the header even if one of the files is empty.
* tests/misc/join: Add a test case.
* NEWS: Mention the fix
---
 NEWS            |    3 +++
 src/join.c      |   12 ++++++++----
 tests/misc/join |    6 ++++++
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/NEWS b/NEWS
index 9ccad63..82565b0 100644
--- a/NEWS
+++ b/NEWS
@@ -13,6 +13,9 @@ GNU coreutils NEWS                                    -*- outline -*-
   rm -f no longer fails for EINVAL or EILSEQ on file systems that
   reject file names invalid for that file system.
 
+  join --header now skips the ordering check for the first line
+  even if the other file is empty.
+
 
 * Noteworthy changes in release 8.9 (2011-01-04) [stable]
 
diff --git a/src/join.c b/src/join.c
index afda5a1..07ac856 100644
--- a/src/join.c
+++ b/src/join.c
@@ -627,13 +627,17 @@ join (FILE *fp1, FILE *fp2)
   initseq (&seq2);
   getseq (fp2, &seq2, 2);
 
-  if (join_header_lines && seq1.count && seq2.count)
+  if (join_header_lines && (seq1.count || seq2.count))
     {
-      prjoin (seq1.lines[0], seq2.lines[0]);
+      struct line const *hline1 = seq1.count ? seq1.lines[0] : &uni_blank;
+      struct line const *hline2 = seq2.count ? seq2.lines[0] : &uni_blank;
+      prjoin (hline1, hline2);
       prevline[0] = NULL;
       prevline[1] = NULL;
-      advance_seq (fp1, &seq1, true, 1);
-      advance_seq (fp2, &seq2, true, 2);
+      if (seq1.count)
+        advance_seq (fp1, &seq1, true, 1);
+      if (seq2.count)
+        advance_seq (fp2, &seq2, true, 2);
     }
 
   while (seq1.count && seq2.count)
diff --git a/tests/misc/join b/tests/misc/join
index 3696a03..0299427 100755
--- a/tests/misc/join
+++ b/tests/misc/join
@@ -219,6 +219,12 @@ my @tv = (
  [ "ID1 Name\n1 A\n2 B\n", "ID2 Color\n1 red\n"],
    "ID1 Name Color\n1 A red\n", 0],
 
+# '--header' doesn't check order of a header
+# even if there is no header in the second file
+['header-6', '--header -a1',
+ [ "ID1 Name\n1 A\n", ""],
+   "ID1 Name\n1 A\n", 0],
+
 );
 
 # Convert the above old-style test vectors to the newer
-- 
1.7.3.4

>From 1ebde7af7d1e122bcb2d0935d0b37af21156f357 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Thu, 27 Jan 2011 07:17:16 +0000
Subject: [PATCH] join: don't report disorder against an empty file

This allows one to use join as a field extractor like:
  join -a1 -o 1.3,1.1 - /dev/null

* src/join.c (join): Don't flag unpairable lines when
one of the files is empty.
* tests/misc/join: Add a new test for empty input, and adjust
a previous test that was only checking against empty input.
* doc/coreutils.texi (join invocation): Document the change.
* NEWS: Likewise.
---
 NEWS               |    6 ++++++
 doc/coreutils.texi |   18 +++++++++++++-----
 src/join.c         |    8 +++++---
 tests/misc/join    |    6 +++++-
 4 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/NEWS b/NEWS
index 9ccad63..422bbe6 100644
--- a/NEWS
+++ b/NEWS
@@ -13,6 +13,12 @@ GNU coreutils NEWS                                    -*- outline -*-
   rm -f no longer fails for EINVAL or EILSEQ on file systems that
   reject file names invalid for that file system.
 
+** Changes in behavior
+
+  join no longer reports disorder when one of the files is empty.
+  This allows one to use join as a field extractor like:
+  join -a1 -o 1.3,1.1 - /dev/null
+
 
 * Noteworthy changes in release 8.9 (2011-01-04) [stable]
 
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 85d5201..c2a7580 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -4761,11 +4761,17 @@ If there is an error it exits with nonzero status.
 @macro checkOrderOption{cmd}
 If the @option{--check-order} option is given, unsorted inputs will
 cause a fatal error message.  If the option @option{--nocheck-order}
-is given, unsorted inputs will never cause an error message.  If
-neither of these options is given, wrongly sorted inputs are diagnosed
-only if an input file is found to contain unpairable lines.  If an
-input file is diagnosed as being unsorted, the @command{\cmd\} command
-will exit with a nonzero status (and the output should not be used).
+is given, unsorted inputs will never cause an error message.  If neither
+of these options is given, wrongly sorted inputs are diagnosed
+only if an input file is found to contain unpairable
+@ifset JOIN_COMMAND
+lines, and when both input files are non empty.
+@end ifset
+@ifclear JOIN_COMMAND
+lines.
+@end ifclear
+If an input file is diagnosed as being unsorted, the @command{\cmd\}
+command will exit with a nonzero status (and the output should not be used).
 
 Forcing @command{\cmd\} to process wrongly sorted input files
 containing unpairable lines by specifying @option{--nocheck-order} is
@@ -5646,7 +5652,9 @@ c c1 c2
 b b1 b2
 @end example
 
+@set JOIN_COMMAND
 @checkOrderOption{join}
+@clear JOIN_COMMAND
 
 The defaults are:
 @itemize
diff --git a/src/join.c b/src/join.c
index afda5a1..6e10f61 100644
--- a/src/join.c
+++ b/src/join.c
@@ -711,7 +711,7 @@ join (FILE *fp1, FILE *fp2)
         seq2.count = 0;
     }
 
-  /* If the user did not specify --check-order, then we read the
+  /* If the user did not specify --nocheck-order, then we read the
      tail ends of both inputs to verify that they are in order.  We
      skip the rest of the tail once we have issued a warning for that
      file, unless we actually need to print the unpairable lines.  */
@@ -726,7 +726,8 @@ join (FILE *fp1, FILE *fp2)
     {
       if (print_unpairables_1)
         prjoin (seq1.lines[0], &uni_blank);
-      seen_unpairable = true;
+      if (seq2.count)
+        seen_unpairable = true;
       while (get_line (fp1, &line, 1))
         {
           if (print_unpairables_1)
@@ -740,7 +741,8 @@ join (FILE *fp1, FILE *fp2)
     {
       if (print_unpairables_2)
         prjoin (&uni_blank, seq2.lines[0]);
-      seen_unpairable = true;
+      if (seq1.count)
+        seen_unpairable = true;
       while (get_line (fp2, &line, 2))
         {
           if (print_unpairables_2)
diff --git a/tests/misc/join b/tests/misc/join
index 3696a03..3ce267c 100755
--- a/tests/misc/join
+++ b/tests/misc/join
@@ -189,7 +189,11 @@ my @tv = (
 
 # Before 6.10.143, this would mistakenly fail with the diagnostic:
 # join: File 1 is not in sorted order
-['chkodr-7', '-12', ["2 a\n1 b\n", ""], "", 0],
+['chkodr-7', '-12', ["2 a\n1 b\n", "2 c\n1 d"], "", 0],
+
+# After 8.9, join doesn't report disorder by default
+# when comparing against an empty input file.
+['chkodr-8', '', ["2 a\n1 b\n", ""], "", 0],
 
 # Test '--header' feature
 ['header-1', '--header',
-- 
1.7.3.4

>From 29746dc55e6176934388c772dbe70012859897ff Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Wed, 5 Jan 2011 11:52:54 +0000
Subject: [PATCH] join: add -o 'auto' to output a constant number of fields per line

Lines with a different number of fields than the first line,
will be truncated or padded.

* src/join.c (prfields): A new function refactored from prjoin(),
to output all but the join field.
(prjoin): Don't swap line1 and line2 when line1 is blank
so that the padding is applied to the right place.
(main): Handle the -o 'auto' option.
* tests/misc/join: Add 6 new cases to test the auto format.
* NEWS: Mention the change in behavior.
Suggestion from Assaf Gordon
---
 NEWS               |    6 +++
 doc/coreutils.texi |   19 ++++++++---
 src/join.c         |   88 +++++++++++++++++++++++++++++++++------------------
 tests/misc/join    |   20 ++++++++++++
 4 files changed, 96 insertions(+), 37 deletions(-)

diff --git a/NEWS b/NEWS
index 9ccad63..a9d329a 100644
--- a/NEWS
+++ b/NEWS
@@ -13,6 +13,12 @@ GNU coreutils NEWS                                    -*- outline -*-
   rm -f no longer fails for EINVAL or EILSEQ on file systems that
   reject file names invalid for that file system.
 
+** New features
+
+  join now supports -o 'auto' which will automatically infer the
+  output format from the first line in each file, to ensure
+  the same number of fields are output for each line.
+
 
 * Noteworthy changes in release 8.9 (2011-01-04) [stable]
 
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 85d5201..9397ab3 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -5675,8 +5675,8 @@ Do not check that both input files are in sorted order.  This is the default.
 
 @item -e @var{string}
 @opindex -e
-Replace those output fields that are missing in the input with
-@var{string}.
+Replace those output fields that are missing in the input with @var{string}.
+I.E. missing fields specified with the @option{-12jo} options.
 
 @item --header
 @opindex --header
@@ -5707,10 +5707,17 @@ Join on field @var{field} (a positive integer) of file 2.
 Equivalent to @option{-1 @var{field} -2 @var{field}}.
 
 @item -o @var{field-list}
-Construct each output line according to the format in @var{field-list}.
-Each element in @var{field-list} is either the single character @samp{0} or
-has the form @var{m.n} where the file number, @var{m}, is @samp{1} or
-@samp{2} and @var{n} is a positive field number.
+@itemx -o auto
+If the keyword @samp{auto} is specified, infer the output format from
+the first line in each file.  This is the same as the default output format
+but also ensures the same number of fields are output for each line.
+Missing fields are replaced with the @option{-e} option and extra fields
+are discarded.
+
+Otherwise, construct each output line according to the format in
+@var{field-list}.  Each element in @var{field-list} is either the single
+character @samp{0} or has the form @var{m.n} where the file number, @var{m},
+is @samp{1} or @samp{2} and @var{n} is a positive field number.
 
 A field specification of @samp{0} denotes the join field.
 In most cases, the functionality of the @samp{0} field spec
diff --git a/src/join.c b/src/join.c
index afda5a1..bf7e908 100644
--- a/src/join.c
+++ b/src/join.c
@@ -112,6 +112,13 @@ static bool issued_disorder_warning[2];
 /* Empty output field filler.  */
 static char const *empty_filler;
 
+/* Whether to ensure the same number of fields are output from each line.  */
+static bool autoformat;
+/* The number of fields to output for each line.
+   Only significant when autoformat is true.  */
+static size_t autocount_1;
+static size_t autocount_2;
+
 /* Field to join on; SIZE_MAX means they haven't been determined yet.  */
 static size_t join_field_1 = SIZE_MAX;
 static size_t join_field_2 = SIZE_MAX;
@@ -210,7 +217,8 @@ else fields are separated by CHAR.  Any FIELD is a field number counted\n\
 from 1.  FORMAT is one or more comma or blank separated specifications,\n\
 each being `FILENUM.FIELD' or `0'.  Default FORMAT outputs the join field,\n\
 the remaining fields from FILE1, the remaining fields from FILE2, all\n\
-separated by CHAR.\n\
+separated by CHAR.  If FORMAT is the keyword 'auto', then the first\n\
+line of each file determines the number of fields output for each line.\n\
 \n\
 Important: FILE1 and FILE2 must be sorted on the join fields.\n\
 E.g., use ` sort -k 1b,1 ' if `join' has no options,\n\
@@ -527,6 +535,27 @@ prfield (size_t n, struct line const *line)
     fputs (empty_filler, stdout);
 }
 
+/* Output all the fields in line, other than the join field.  */
+
+static void
+prfields (struct line const *line, size_t join_field, size_t autocount)
+{
+  size_t i;
+  size_t nfields = autoformat ? autocount : line->nfields;
+  char output_separator = tab < 0 ? ' ' : tab;
+
+  for (i = 0; i < join_field && i < nfields; ++i)
+    {
+      putchar (output_separator);
+      prfield (i, line);
+    }
+  for (i = join_field + 1; i < nfields; ++i)
+    {
+      putchar (output_separator);
+      prfield (i, line);
+    }
+}
+
 /* Print the join of LINE1 and LINE2.  */
 
 static void
@@ -534,6 +563,8 @@ prjoin (struct line const *line1, struct line const *line2)
 {
   const struct outlist *outlist;
   char output_separator = tab < 0 ? ' ' : tab;
+  size_t field;
+  struct line const *line;
 
   outlist = outlist_head.next;
   if (outlist)
@@ -543,9 +574,6 @@ prjoin (struct line const *line1, struct line const *line2)
       o = outlist;
       while (1)
         {
-          size_t field;
-          struct line const *line;
-
           if (o->file == 0)
             {
               if (line1 == &uni_blank)
@@ -574,37 +602,24 @@ prjoin (struct line const *line1, struct line const *line2)
     }
   else
     {
-      size_t i;
-
       if (line1 == &uni_blank)
         {
-          struct line const *t;
-          t = line1;
-          line1 = line2;
-          line2 = t;
+          line = line2;
+          field = join_field_2;
         }
-      prfield (join_field_1, line1);
-      for (i = 0; i < join_field_1 && i < line1->nfields; ++i)
-        {
-          putchar (output_separator);
-          prfield (i, line1);
-        }
-      for (i = join_field_1 + 1; i < line1->nfields; ++i)
+      else
         {
-          putchar (output_separator);
-          prfield (i, line1);
+          line = line1;
+          field = join_field_1;
         }
 
-      for (i = 0; i < join_field_2 && i < line2->nfields; ++i)
-        {
-          putchar (output_separator);
-          prfield (i, line2);
-        }
-      for (i = join_field_2 + 1; i < line2->nfields; ++i)
-        {
-          putchar (output_separator);
-          prfield (i, line2);
-        }
+      /* Output the join field.  */
+      prfield (field, line);
+
+      /* Output other fields.  */
+      prfields (line1, join_field_1, autocount_1);
+      prfields (line2, join_field_2, autocount_2);
+
       putchar ('\n');
     }
 }
@@ -627,6 +642,12 @@ join (FILE *fp1, FILE *fp2)
   initseq (&seq2);
   getseq (fp2, &seq2, 2);
 
+  if (autoformat)
+    {
+      autocount_1 = seq1.count ? seq1.lines[0]->nfields : 0;
+      autocount_2 = seq2.count ? seq2.lines[0]->nfields : 0;
+    }
+
   if (join_header_lines && seq1.count && seq2.count)
     {
       prjoin (seq1.lines[0], seq2.lines[0]);
@@ -1037,8 +1058,13 @@ main (int argc, char **argv)
           break;
 
         case 'o':
-          add_field_list (optarg);
-          optc_status = MIGHT_BE_O_ARG;
+          if (STREQ (optarg, "auto"))
+            autoformat = true;
+          else
+            {
+              add_field_list (optarg);
+              optc_status = MIGHT_BE_O_ARG;
+            }
           break;
 
         case 't':
diff --git a/tests/misc/join b/tests/misc/join
index 3696a03..3cf278b 100755
--- a/tests/misc/join
+++ b/tests/misc/join
@@ -127,6 +127,26 @@ my @tv = (
 # From David Dyck
 ['9a', '', [" a 1\n b 2\n", " a Y\n b Z\n"], "a 1 Y\nb 2 Z\n", 0],
 
+# -o 'auto'
+['10a', '-a1 -a2 -e . -o auto',
+ ["a 1 2\nb 1\nd 1 2\n", "a 3 4\nb 3 4\nc 3 4\n"],
+ "a 1 2 3 4\nb 1 . 3 4\nc . . 3 4\nd 1 2 . .\n", 0],
+['10b', '-a1 -a2 -j3 -e . -o auto',
+ ["a 1 2\nb 1\nd 1 2\n", "a 3 4\nb 3 4\nc 3 4\n"],
+ "2 a 1 . .\n. b 1 . .\n2 d 1 . .\n4 . . a 3\n4 . . b 3\n4 . . c 3\n"],
+['10c', '-a1 -1 1 -2 4 -e. -o auto',
+ ["a 1 2\nb 1\nd 1 2\n", "a 3 4\nb 3 4\nc 3 4\n"],
+ "a 1 2 . . .\nb 1 . . . .\nd 1 2 . . .\n"],
+['10d', '-a2 -1 1 -2 4 -e. -o auto',
+ ["a 1 2\nb 1\nd 1 2\n", "a 3 4\nb 3 4\nc 3 4\n"],
+ ". . . a 3 4\n. . . b 3 4\n. . . c 3 4\n"],
+['10e', '-o auto',
+ ["a 1 2\nb 1 2 discard\n", "a 3 4\nb 3 4 discard\n"],
+ "a 1 2 3 4\nb 1 2 3 4\n"],
+['10f', '-t, -o auto',
+ ["a,1,,2\nb,1,2\n", "a,3,4\nb,3,4\n"],
+ "a,1,,2,3,4\nb,1,2,,3,4\n"],
+
 # From Tim Smithers: fixed in 1.22l
 ['trailing-sp', '-t: -1 1 -2 1', ["a:x \n", "a:y \n"], "a:x :y \n", 0],
 
-- 
1.7.3.4

>From 97ae252c17fdf913cad04438fa873e840c397977 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Wed, 24 Nov 2010 08:37:23 +0000
Subject: [PATCH] doc: add alternatives for field processing not supported by cut

* doc/coreutils.texi (cut invocation): Remove the tr -s '[:blank:]'
example, as it doesn't handle leading and trailing blanks.  Add `awk`
examples for common field processing operations often asked about.
Also document a `join` hack, to achieve the same thing.
---
 doc/coreutils.texi |   23 +++++++++++++++++++----
 1 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 85d5201..b30b68f 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -5474,11 +5474,26 @@ Select for printing only the fields listed in @var{field-list}.
 Fields are separated by a TAB character by default.  Also print any
 line that contains no delimiter character, unless the
 @option{--only-delimited} (@option{-s}) option is specified.
-Note @command{cut} does not support specifying runs of whitespace as a
-delimiter, so to achieve that common functionality one can pre-process
-with @command{tr} like:
+
+Note @command{awk} supports more sophisticated field processing,
+and by default will use (and discard) runs of blank characters to
+separate fields, and ignore leading and trailing blanks.
+@example
+@verbatim
+awk '{print $2}'    # print the second field
+awk '{print $NF-1}' # print the penultimate field
+awk '{print $2,$1}' # reorder the first two fields
+@end verbatim
+@end example
+
+In the unlikely event that @command{awk} is unavailable,
+one can use the @command{join} command, to process blank
+characters as @command{awk} does above.
 @example
-tr -s '[:blank:]' '\t' | cut -f@dots{}
+@verbatim
+join -a1 - /dev/null -o 1.2     # print the second field
+join -a1 - /dev/null -o 1.2,1.1 # reorder the first two fields
+@end verbatim
 @end example
 
 @item -d @var{input_delim_byte}
-- 
1.7.3.4

Re: [coreutils] coreutils-8.10 next week

Reply via email to