Re: performance bug of `wc -m`

Philip Rowlands Sun, 13 May 2018 15:25:40 -0700

On Sun, 13 May 2018, at 02:55, Peng Yu wrote:
> Hi,
> 
> The following example shows that `wc -m` is even slower than the
> equivalent Python code. Can this performance bug be fixed?


I can reproduce the slow wc behaviour with UTF-8 enabled locales.

$ echo $LANG
en_GB.UTF-8

$ seq 1000000 | time -p wc -c
6888896
real 0.05
user 0.00
sys 0.02

$ seq 1000000 | time -p wc -m
6888896
real 0.60
user 0.58
sys 0.00

$ seq 1000000 | LANG=C time -p wc -m
6888896
real 0.05
user 0.00
sys 0.02

In the slow case, wc is spending most of its time in iswprint / wcwidth / 
iswspace. Perhaps wc could learn a faster method of counting utf-8 
(https://stackoverflow.com/a/7298149); this may be worthwhile as the trend to 
utf-8 everywhere marches on.

I can't explain without more digging why Python's string decode('utf-8') is 
better optimised for length calculations.

Cheers,
Phil

Re: performance bug of `wc -m`

Reply via email to