bug#17637: bug "cut of end-line is skipped"

Pádraig Brady Fri, 30 May 2014 10:45:07 -0700

On 05/30/2014 03:34 AM, Pádraig Brady wrote:
> On 05/29/2014 11:53 PM, Eric Blake wrote:
>> On 05/29/2014 04:24 PM, Pádraig Brady wrote:
>>> tag 17637 notabug
>>> close 17637
>>> stop
>>
>> On the one hand, this feels a bit premature.
>>
>>>
>>> That change 
>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=51ce0bf8
>>> was made in v8.21 to fix http://bugs.gnu.org/13498
>>
>> Are you sure you didn't mean the next commit:
>>
>> http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=d302aed
> 
> Right sorry.


Or rather the two in combination.

>> But both of those commits are in coreutils 8.21, whereas the Fedora 20
>> build of coreutils 8.21 does not have that behavior.  Is downstream
>> patching things in a way to make it work, and if so, why can't we
>> backport what Fedora has added on top?
> 
> That's the i18n patch which has diverged here:
> 
>   $ seq 10 | LANG=C cut -s -f2 -d$'\n'
>   $ seq 10 | cut -s -f2 -d$'\n'
>   2
> 
> Unfortunately that means we have an inconsistency.
> Also many users might still be getting the old behavior
> (and thus not complaining about the new behavior)
> and Rudy may be hitting this only because the script is
> being run in the C locale?

Seems debian/ubuntu are closer to upstream and don't
apply the i18n patch. So it's good that lots were
exposed to this change and this is the first reported issue.

>>> It was made for a good reason, to handle the buffering issues detailed
>>> in the above bug. Your existing usage was a bit of an edge case and not
>>> supported with other cut implementations, and while we try to avoid
>>> changes like this it was thought the benefits outweighed the impact
>>> for the very few who use cut in this way.
>>
>> But while you documented the improved buffering behavior in NEWS, you
>> failed to document the corner-case change to -d$'\n'.
>>
>> On the other hand, I confirmed that both Solaris and FreeBSD cut behave
>> the same way as the new GNU cut behavior.
>>
>> $ nl='
>> '
>> $ printf 'a\t1\nb\t2\n' | cut -d"$nl" -f1
>> a       1
>> b       2
>>
>> So keeping the new behavior in the name of consistency makes sense,
>> although it still might be nice to add a retroactive NEWS entry.
> 
> Ugh I'm not sure now. Consistency is good if that consistent
> behavior is needed, though I suppose the use case of using -s -d$'\n'
> to suppress the last line if it has no trailing newline is a lot more
> esoteric than using cut like this for example:
> 
>   $ seq 10 | cut -f2,3,7 -d$'\n' --output-delimiter='|'
>   2|3|7
> 
> So I'm leaning towards restoring that behavior.
> (I notice cut consumes all input even if the last
> line (field) needed is output, so we could improve that too).
> 
> I'll sleep on it.

Arguments for reverting to old behavior:
  compat for coreutils only scripts
  compat with existing i18n patch in various distros
  Valid use cases albeit achievable with other utils

Arguments for keeping new behavior
  avoids users introducing unneeded GNU extensions to scripts
  avoids special casing last '\n' in file
  allows for more efficient implementation

Compat concerns win here I think so
I'm 55:45 in favor of applying the attached patch
to reinstate the functionality.

thanks,
Pádraig.

>From b8f14b16dbbdda1f48629d11a7a63671248af5af Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Fri, 30 May 2014 17:44:32 +0100
Subject: [PATCH] cut: restore special case handling of -f with -d$'\n'

commits v8.20-98-g51ce0bf and v8.20-99-gd302aed changed cut(1)
to process each line independently and thus promptly output
each line without buffering.  As part of those changes we removed
the special handling of --delimiter=$'\n' --fields=... which
could be used to select arbitrary (ranges of) lines, so as to
simplify and optimize the implementation while also matching the
behavior of different cut(1) implementations.

However that GNU behavior was in place for a long time, and
could be useful in certain cases like making a separated list like
`seq 10 | cut -f1- -d$'\n' --output-delimiter=,` although other tools
like head(1) and paste(1) are more suited to this operation.
The "line behind" buffering behavior is now restricted to
only the -d$'\n' case.

While reinstating this functionality we also fix the following
related edge case to be more consistent:

  before$ printf "\n" | cut -s -d$'\n' -f1- | wc -l
  2
  after $ printf "\n" | cut  -d$'\n' -f1- | wc -l
  1

* src/cut.c (cut_fields): Adjust as discussed above.
* tests/misc/cut.pl: Adjust accordingly.
* NEWS: Mention the change in behavior both for v8.21
and this effective revert.
* cfg.mk (old_NEWS_hash): Adjust for originally omitted v8.21 entry.
* src/paste.c: s/delimeter/delimiter/ comment typo fix.
---
 NEWS              |    7 +++++++
 cfg.mk            |    2 +-
 src/cut.c         |   46 +++++++++++++++++++++++++++++++++-------------
 src/paste.c       |    2 +-
 tests/misc/cut.pl |   12 +++++++-----
 5 files changed, 49 insertions(+), 20 deletions(-)

diff --git a/NEWS b/NEWS
index 9679b46..29aa189 100644
--- a/NEWS
+++ b/NEWS
@@ -88,6 +88,10 @@ GNU coreutils NEWS                                    -*- outline -*-
   chroot --userspec will now unset supplemental groups associated with root,
   and instead use the supplemental groups of the specified user.
 
+  cut -d$'\n' again outputs the lines specified in the --fields list.
+  Note using this non portable functionality (not present in v8.21 and v8.22)
+  will result in the delayed output of lines.
+
   ls with none of LS_COLORS or COLORTERM environment variables set,
   will now honor an empty or unknown TERM environment variable,
   and not output colors even with --colors=always.
@@ -341,6 +345,9 @@ GNU coreutils NEWS                                    -*- outline -*-
   the system by skipping duplicate entries (identified by the device number).
   Consequently, df also elides the early-boot pseudo file system type "rootfs".
 
+  cut -d$'\n' no longer outputs the lines specified in the --fields list,
+  to align with other implementations and to avoid delayed output of lines.
+
   nl no longer supports the --page-increment option, which has been
   deprecated since coreutils-7.5.  Use --line-increment instead.
 
diff --git a/cfg.mk b/cfg.mk
index 1e884cd..f6a80a4 100644
--- a/cfg.mk
+++ b/cfg.mk
@@ -45,7 +45,7 @@ export VERBOSE = yes
 # 4914152 9e
 export XZ_OPT = -8e
 
-old_NEWS_hash = 68fc9b352e924d5e59e2f543f80f6a41
+old_NEWS_hash = 70254636cbd566a257d147444eb21435
 
 # Add an exemption for sc_makefile_at_at_check.
 _makefile_at_at_check_exceptions = ' && !/^cu_install_program =/'
diff --git a/src/cut.c b/src/cut.c
index 5528068..312551f 100644
--- a/src/cut.c
+++ b/src/cut.c
@@ -109,13 +109,13 @@ enum operating_mode
     /* Output characters that are in the given bytes. */
     byte_mode,
 
-    /* Output the given delimeter-separated fields. */
+    /* Output the given delimiter-separated fields. */
     field_mode
   };
 
 static enum operating_mode operating_mode;
 
-/* If true do not output lines containing no delimeter characters.
+/* If true do not output lines containing no delimiter characters.
    Otherwise, all such lines are printed.  This option is valid only
    with field mode.  */
 static bool suppress_non_delimited;
@@ -124,7 +124,7 @@ static bool suppress_non_delimited;
    those that were specified.  */
 static bool complement;
 
-/* The delimeter character for field mode. */
+/* The delimiter character for field mode. */
 static unsigned char delim;
 
 /* True if the --output-delimiter=STRING option was specified.  */
@@ -538,7 +538,6 @@ cut_fields (FILE *stream)
         {
           ssize_t len;
           size_t n_bytes;
-          bool got_line;
 
           len = getndelim2 (&field_1_buffer, &field_1_bufsize, 0,
                             GETNLINE_NO_LIMIT, delim, '\n', stream);
@@ -555,14 +554,13 @@ cut_fields (FILE *stream)
           assert (n_bytes != 0);
 
           c = 0;
-          got_line = field_1_buffer[n_bytes - 1] == '\n';
 
           /* If the first field extends to the end of line (it is not
              delimited) and we are printing all non-delimited lines,
              print this one.  */
-          if (to_uchar (field_1_buffer[n_bytes - 1]) != delim || got_line)
+          if (to_uchar (field_1_buffer[n_bytes - 1]) != delim)
             {
-              if (suppress_non_delimited && !(got_line && delim == '\n'))
+              if (suppress_non_delimited)
                 {
                   /* Empty.  */
                 }
@@ -570,7 +568,7 @@ cut_fields (FILE *stream)
                 {
                   fwrite (field_1_buffer, sizeof (char), n_bytes, stdout);
                   /* Make sure the output line is newline terminated.  */
-                  if (! got_line)
+                  if (field_1_buffer[n_bytes - 1] != '\n')
                     putchar ('\n');
                   c = '\n';
                 }
@@ -580,7 +578,19 @@ cut_fields (FILE *stream)
             {
               /* Print the field, but not the trailing delimiter.  */
               fwrite (field_1_buffer, sizeof (char), n_bytes - 1, stdout);
-              found_any_selected_field = true;
+
+              /* With -d$'\n' don't treat the last '\n' as a delimiter.  */
+              if (delim == '\n')
+                {
+                  int last_c = getc (stream);
+                  if (last_c != EOF)
+                    {
+                      ungetc (last_c, stream);
+                      found_any_selected_field = true;
+                    }
+                }
+              else
+                found_any_selected_field = true;
             }
           next_item (&field_idx);
         }
@@ -610,12 +620,24 @@ cut_fields (FILE *stream)
             }
         }
 
-      if (c == '\n' || c == EOF)
+      /* With -d$'\n' don't treat the last '\n' as a delimiter.  */
+      if (delim == '\n' && c == delim)
+        {
+          int last_c = getc (stream);
+          if (last_c != EOF)
+            ungetc (last_c, stream);
+          else
+            c = last_c;
+        }
+
+      if (c == delim)
+        next_item (&field_idx);
+      else if (c == '\n' || c == EOF)
         {
           if (found_any_selected_field
               || !(suppress_non_delimited && field_idx == 1))
             {
-              if (c == '\n' || prev_c != '\n')
+              if (c == '\n' || prev_c != '\n' || delim == '\n')
                 putchar ('\n');
             }
           if (c == EOF)
@@ -624,8 +646,6 @@ cut_fields (FILE *stream)
           current_rp = rp;
           found_any_selected_field = false;
         }
-      else if (c == delim)
-        next_item (&field_idx);
     }
 }
 
diff --git a/src/paste.c b/src/paste.c
index 707c495..3663aaf 100644
--- a/src/paste.c
+++ b/src/paste.c
@@ -62,7 +62,7 @@ static bool have_read_stdin;
    corresponding lines from each file in parallel. */
 static bool serial_merge;
 
-/* The delimeters between lines of input files (used cyclically). */
+/* The delimiters between lines of input files (used cyclically). */
 static char *delims;
 
 /* A pointer to the character after the end of 'delims'. */
diff --git a/tests/misc/cut.pl b/tests/misc/cut.pl
index 2952367..0418862 100755
--- a/tests/misc/cut.pl
+++ b/tests/misc/cut.pl
@@ -144,15 +144,17 @@ my @Tests =
   ['newline-12', '-s', '-d:', '-f1', {IN=>"a:1\nb:"}, {OUT=>"a\nb\n"}],
   ['newline-13', '-d:', '-f1-', {IN=>"a1:\n:"}, {OUT=>"a1:\n:\n"}],
   # newline processing for fields when -d == '\n'
-  ['newline-14', "-d'\n'", '-f1', {IN=>"a:1\nb:"}, {OUT=>"a:1\nb:\n"}],
+  ['newline-14', "-d'\n'", '-f1', {IN=>"a:1\nb:"}, {OUT=>"a:1\n"}],
   ['newline-15', '-s', "-d'\n'", '-f1', {IN=>"a:1\nb:"}, {OUT=>"a:1\n"}],
-  ['newline-16', '-s', "-d'\n'", '-f2', {IN=>"\nb"}, {OUT=>""}],
+  ['newline-16', '-s', "-d'\n'", '-f2', {IN=>"\nb"}, {OUT=>"b\n"}],
   ['newline-17', '-s', "-d'\n'", '-f1', {IN=>"\nb"}, {OUT=>"\n"}],
-  ['newline-18', "-d'\n'", '-f2', {IN=>"\nb"}, {OUT=>"\nb\n"}],
-  ['newline-19', "-d'\n'", '-f1', {IN=>"\nb"}, {OUT=>"\nb\n"}],
+  ['newline-18', "-d'\n'", '-f2', {IN=>"\nb"}, {OUT=>"b\n"}],
+  ['newline-19', "-d'\n'", '-f1', {IN=>"\nb"}, {OUT=>"\n"}],
   ['newline-20', '-s', "-d'\n'", '-f1-', {IN=>"\n"}, {OUT=>"\n"}],
-  ['newline-21', '-s', "-d'\n'", '-f1-', {IN=>"\nb"}, {OUT=>"\n"}],
+  ['newline-21', '-s', "-d'\n'", '-f1-', {IN=>"\nb"}, {OUT=>"\nb\n"}],
   ['newline-22', "-d'\n'", '-f1-', {IN=>"\nb"}, {OUT=>"\nb\n"}],
+  ['newline-23', "-d'\n'", '-f1-', '--ou=:', {IN=>"a\nb\n"}, {OUT=>"a:b\n"}],
+  ['newline-24', "-d'\n'", '-f1,2', '--ou=:', {IN=>"a\nb\n"}, {OUT=>"a:b\n"}],
 
   # New functionality:
   ['out-delim1', '-c1-3,5-', '--output-d=:', {IN=>"abcdefg\n"},
-- 
1.7.7.6

bug#17637: bug "cut of end-line is skipped"

Reply via email to