bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-26 Thread Paul Eggert

Pádraig Brady wrote:

I've pushed the c_iscntrl patch since it's simplest
and probably most appropriate patch for an existing release.


Yes, that makes sense for a quick patch. However, for the next release I think 
it'd be better to catch encoding errors and multibyte control characters, given 
the problems noted. I installed the attached further patch to try to do this. 
This fixes the problem that Bruno noted, along with two others; my earlier patch 
neglected the possibility that mbrtowc can return 0, and it incorrectly assumed 
wide control characters always have a single-byte representation.


Either way the original bug appears to be fix so I'm boldly closing the bug 
report.
From 2cf5d730690dad600f8b6d74d0b5fde522804e43 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Sun, 22 Jul 2018 09:50:20 -0700
Subject: [PATCH] df: avoid multibyte character corruption on macOS

This improves on the earlier fix for the problem reported by
Chih-Hsuan Yen (Bug#32236), by also looking for other control
characters and for encoding errors.
* src/df.c: Include wchar.h and wctype.h instead of c-ctype.h.
(hide_problematic_chars): Process the string as multibyte.
Use iswcntrl, not c_iscntrl.
---
 src/df.c | 43 ---
 1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/src/df.c b/src/df.c
index c851fcc..d27ba02 100644
--- a/src/df.c
+++ b/src/df.c
@@ -23,7 +23,8 @@
 #include 
 #include 
 #include 
-#include 
+#include 
+#include 
 
 #include "system.h"
 #include "canonicalize.h"
@@ -272,21 +273,41 @@ static struct option const long_options[] =
   {NULL, 0, NULL, 0}
 };
 
-/* Replace problematic chars with '?'.
-   Since only control characters are currently considered,
-   this should work in all encodings.  */
+/* Replace problematic chars with '?'.  */
 
-static char*
+static void
 hide_problematic_chars (char *cell)
 {
-  char *p = cell;
-  while (*p)
+  char *srcend = cell + strlen (cell);
+  char *dst = cell;
+  mbstate_t mbstate = { 0, };
+  size_t n;
+
+  for (char *src = cell; src != srcend; src += n)
 {
-  if (c_iscntrl (to_uchar (*p)))
-*p = '?';
-  p++;
+  wchar_t wc;
+  size_t srcbytes = srcend - src;
+  n = mbrtowc (&wc, src, srcbytes, &mbstate);
+  bool ok = 0 < n && n <= srcbytes;
+
+  if (ok)
+ok = !iswcntrl (wc);
+  else
+n = 1;
+
+  if (ok)
+{
+  memmove (dst, src, n);
+  dst += n;
+}
+  else
+{
+  *dst++ = '?';
+  memset (&mbstate, 0, sizeof mbstate);
+}
 }
-  return cell;
+
+  *dst = '\0';
 }
 
 /* Dynamically allocate a row of pointers in TABLE, which
-- 
2.7.4



bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-26 Thread Pádraig Brady
On 26/07/18 02:01, Paul Eggert wrote:
> Chih-Hsuan Yen wrote:
>> How about following the idea from Pádraig Brady
>> and filter \n only?
> 
> Given the later comments it seems better to filter out encoding errors and 
> control characters. Programs that parse the output already cannot trust the 
> strings to be exactly right, since newlines are gonna get replaced no matter 
> what. So there seems little benefit to copying the other garbage faithfully.
> 
> Revised proposed patch(es) attached.

This is better, though this means that mount points now
need to match the locale of df or they won't be displayed.
Theoretically that was the case previously, but only for control chars
and so wouldn't have have had a practical impact for mounts
encoded in another local, only for security/robustness reasons where
one might have \n etc.

I've pushed the c_iscntrl patch since it's simplest
and probably most appropriate patch for an existing release.

If you consider the matching encoding issue as a non issue,
then I'm OK with this.

cheers,
Pádraig





bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-26 Thread Bruno Haible
Paul Eggert wrote:
> Revised proposed patch(es) attached.

Looks good to me, except for one little thing:

>   memcpy (dst, src, n);

src and dst may overlap. Therefore memmove should be used instead of memcpy.

Bruno






bug#32267: dd's ucase and lcase and LC_CTYPE.

2018-07-26 Thread Paul Eggert
Yes, this is a known issue with dd as with many other coreutils programs. 
Strictly speaking as I understand it, it is not a deviation from POSIX, since 
POSIX does not require support for locales with multibyte encodings. Still, it 
would be nice to fix dd at some point, although it'd be a pain to do correctly 
and efficiently and it's long been low priority since hardly anybody needs or 
uses this feature on any platform.






bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS

2018-07-26 Thread Paul Eggert

Chih-Hsuan Yen wrote:

How about following the idea from Pádraig Brady
and filter \n only?


Given the later comments it seems better to filter out encoding errors and 
control characters. Programs that parse the output already cannot trust the 
strings to be exactly right, since newlines are gonna get replaced no matter 
what. So there seems little benefit to copying the other garbage faithfully.


Revised proposed patch(es) attached.
>From e4f2a5f97771c4a74e0bdef1c1e4a0d2735cef15 Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Sun, 22 Jul 2018 09:50:20 -0700
Subject: [PATCH 1/2] df: avoid multibyte character corruption on macOS

Problem reported by Chih-Hsuan Yen (Bug#32236).
* NEWS: Mention the bug fix.
* src/df.c: Include wchar.h and wctype.h.
(hide_problematic_chars): Respect multibyte encodings
when replacing problematic characters or bytes with '?'.
---
 NEWS |  4 
 src/df.c | 35 +--
 2 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/NEWS b/NEWS
index af1a990..aa3b4f9 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,10 @@ GNU coreutils NEWS-*- outline -*-
 
 * Noteworthy changes in release ?.? (-??-??) [?]
 
+** Bug fixes
+
+  df no longer corrupts displayed multibyte characters on macOS.
+
 
 * Noteworthy changes in release 8.30 (2018-07-01) [stable]
 
diff --git a/src/df.c b/src/df.c
index 1178865..52be414b 100644
--- a/src/df.c
+++ b/src/df.c
@@ -23,6 +23,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "system.h"
 #include "canonicalize.h"
@@ -271,21 +273,34 @@ static struct option const long_options[] =
   {NULL, 0, NULL, 0}
 };
 
-/* Replace problematic chars with '?'.
-   Since only control characters are currently considered,
-   this should work in all encodings.  */
+/* Replace problematic chars with '?'.  */
 
-static char*
+static void
 hide_problematic_chars (char *cell)
 {
-  char *p = cell;
-  while (*p)
+  char *srcend = cell + strlen (cell);
+  char *dst = cell;
+  mbstate_t mbstate = { 0, };
+  size_t n;
+
+  for (char *src = cell; src != srcend; src += n)
 {
-  if (iscntrl (to_uchar (*p)))
-*p = '?';
-  p++;
+  wchar_t wc;
+  size_t srcbytes = srcend - src;
+  n = mbrtowc (&wc, src, srcbytes, &mbstate);
+  if (n <= srcbytes && !iswcntrl (wc))
+{
+  memcpy (dst, src, n);
+  dst += n;
+}
+  else
+{
+  *dst++ = '?';
+  memset (&mbstate, 0, sizeof mbstate);
+}
 }
-  return cell;
+
+  *dst = '\0';
 }
 
 /* Dynamically allocate a row of pointers in TABLE, which
-- 
2.7.4

>From e40ae8fb26fd0c8c0cb7e42598a2b8d9bd8551bb Mon Sep 17 00:00:00 2001
From: Paul Eggert 
Date: Thu, 26 Jul 2018 01:56:28 -0700
Subject: [PATCH 2/2] df: tune slightly

* src/df.c (get_header, get_dev):
Avoid calling mbswidth twice when once will do.
---
 src/df.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/df.c b/src/df.c
index 52be414b..f5bd26d 100644
--- a/src/df.c
+++ b/src/df.c
@@ -588,7 +588,8 @@ get_header (void)
 
   table[nrows - 1][col] = cell;
 
-  columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0));
+  size_t cell_width = mbswidth (cell, 0);
+  columns[col]->width = MAX (columns[col]->width, cell_width);
 }
 }
 
@@ -1198,7 +1199,8 @@ get_dev (char const *disk, char const *mount_point, char const* file,
 assert (!"empty cell");
 
   hide_problematic_chars (cell);
-  columns[col]->width = MAX (columns[col]->width, mbswidth (cell, 0));
+  size_t cell_width = mbswidth (cell, 0);
+  columns[col]->width = MAX (columns[col]->width, cell_width);
   table[nrows - 1][col] = cell;
 }
   free (dev_name);
-- 
2.7.4