Re: performance bug of `wc -m`

Pádraig Brady Thu, 17 May 2018 23:27:56 -0700

On 16/05/18 17:31, Eric Fischer wrote:
> I should also add that the core reason that wc is slow and Python is fast
> is not that UTF-8 decoding in wc is slow, it is that the Python code is
> just counting characters, while wc is also maintaining a line width
> for --max-line-length. It doesn't really need to do this, and probably
> shouldn't do this, unless --max-line-length is specified.


I had a look at adding some short-circuiting like you suggested.
The attached gives these results on Fedora 25

$ yes áááááááááááááááááááá | head -n100000 > mbc.txt
$ yes 12345678901234567890 | head -n100000 > num.txt

===== Before ====

$ time src/wc -Lm < mbc.txt
2100000      20
real    0m0.186s

$ time src/wc -m < mbc.txt
2100000
real    0m0.186s

$ time src/wc -Lm < num.txt
2100000      20
real    0m0.055s

$ time src/wc -m < num.txt
2100000
real    0m0.056s

==== After ====

$ time src/wc -Lm < mbc.txt
2100000      20
real    0m0.196s

$ time src/wc -m < mbc.txt
2100000
real    0m0.173s

$ time src/wc -Lm < num.txt
2100000      20
real    0m0.031s

$ time src/wc -m < num.txt
2100000
real    0m0.028s

================

>From this we can deduce that short-circuiting iswprint()
is a small win, but avoiding wcwidth() can be a small loss
on Linux due to the low overhead which is negated by
the additional conditional.

Note this also shows that I'm not reproducing the very slow wc
operation that Peng Yu noticed (the python test program operates
at much the same speed, but wc is _much_ quicker here).
I suspect that this is due to iswprint()/wcwidth() being mega slow on OSX?
If that's the case then the wcwidth() avoidance is worth it
even though it slows us down a little in one case on Linux.

Now I see we may be replacing wcwidth() on OSX as there are issues
with OSX handling of combining characters in UTF-8.
So maybe the slow down is with the gnulib wcwidth!?
To test that I did:

  $ gl_cv_func_wcwidth_works=no ./configure --quiet
  $ time src/wc -Lm < mbc.txt
  2100000      20
  real  0m0.225s

So only slightly slower.  That suggests we're not replacing
wcwidth on this OSX system, and that the system implementation
is just very slow, which the attached patch should avoid if possible.

cheers,
Pádraig

p.s. I'm slightly worried that the _existing_ fast path processing
for the is_basic() set, may be too big a set for mbrtowc() avoidance
for GB18030 for example?

From b0cb0ed373f02ede2046fa79a64e276e5bc8d269 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]>
Date: Thu, 17 May 2018 21:41:46 -0700
Subject: [PATCH] wc: optimize processing of ASCII in multi byte locales
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

===== Benchmark setup (on GNU/Linux) ====
$ yes Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡Ã¡ | head -n100000 > mbc.txt
$ yes 12345678901234567890 | head -n100000 > num.txt

===== Before ====
$ time src/wc -Lm < mbc.txt
real    0m0.186s
$ time src/wc -m < mbc.txt
real    0m0.186s
$ time src/wc -Lm < num.txt
real    0m0.055s
$ time src/wc -m < num.txt
real    0m0.056s

==== After ====
$ time src/wc -Lm < mbc.txt
real    0m0.196s
$ time src/wc -m < mbc.txt
real    0m0.173s
$ time src/wc -Lm < num.txt
real    0m0.031s
$ time src/wc -m < num.txt
real    0m0.028s

* src/wc.c (wc): Only call wide variant functions like
iswprint() and wcwidth() for non is_basic() characters.
I.E. non ISO C "basic character set" characters.
This is especially significant on OSX where wcwidth()
is very expensive (about 10x in tests).
* NEWS: Mention the improvement.
Suggested by Eric Fishcer.
---
 NEWS     |  3 +++
 src/wc.c | 30 +++++++++++++++++++++++-------
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/NEWS b/NEWS
index 4c63b6d..f447d6d 100644
--- a/NEWS
+++ b/NEWS
@@ -53,6 +53,9 @@ GNU coreutils NEWS                                    -*- outline -*-
   version of XFS.  stat -f --format=%T now reports the file system type,
   and tail -f uses inotify.
 
+  wc avoids redundant processing of ASCII text in multibyte locales,
+  which is especially significant on macOS.
+
 
 * Noteworthy changes in release 8.29 (2017-12-27) [stable]
 
diff --git a/src/wc.c b/src/wc.c
index 0c72042..2034c42 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -379,6 +379,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
             {
               wchar_t wide_char;
               size_t n;
+              bool wide = true;
 
               if (!in_shift && is_basic (*p))
                 {
@@ -386,6 +387,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
                      mbrtowc().  */
                   n = 1;
                   wide_char = *p;
+                  wide = false;
                 }
               else
                 {
@@ -419,9 +421,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
                       n = 1;
                     }
                 }
-              p += n;
-              bytes_read -= n;
-              chars++;
+
               switch (wide_char)
                 {
                 case '\n':
@@ -445,17 +445,33 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
                   in_word = false;
                   break;
                 default:
-                  if (iswprint (wide_char))
+                  if (wide && iswprint (wide_char))
                     {
-                      int width = wcwidth (wide_char);
-                      if (width > 0)
-                        linepos += width;
+                      /* wcwidth can be expensive on OSX for example,
+                         so avoid if uneeded.  */
+                      if (print_linelength)
+                        {
+                          int width = wcwidth (wide_char);
+                          if (width > 0)
+                            linepos += width;
+                        }
                       if (iswspace (wide_char))
                         goto mb_word_separator;
                       in_word = true;
                     }
+                  else if (!wide && isprint (to_uchar (*p)))
+                    {
+                      linepos++;
+                      if (isspace (to_uchar (*p)))
+                        goto mb_word_separator;
+                      in_word = true;
+                    }
                   break;
                 }
+
+              p += n;
+              bytes_read -= n;
+              chars++;
             }
           while (bytes_read > 0);
 
-- 
2.9.3

Re: performance bug of `wc -m`

Reply via email to