On Sun, 13 May 2018, at 02:55, Peng Yu wrote: > Hi, > > The following example shows that `wc -m` is even slower than the > equivalent Python code. Can this performance bug be fixed?
I can reproduce the slow wc behaviour with UTF-8 enabled locales. $ echo $LANG en_GB.UTF-8 $ seq 1000000 | time -p wc -c 6888896 real 0.05 user 0.00 sys 0.02 $ seq 1000000 | time -p wc -m 6888896 real 0.60 user 0.58 sys 0.00 $ seq 1000000 | LANG=C time -p wc -m 6888896 real 0.05 user 0.00 sys 0.02 In the slow case, wc is spending most of its time in iswprint / wcwidth / iswspace. Perhaps wc could learn a faster method of counting utf-8 (https://stackoverflow.com/a/7298149); this may be worthwhile as the trend to utf-8 everywhere marches on. I can't explain without more digging why Python's string decode('utf-8') is better optimised for length calculations. Cheers, Phil
