On 2018-05-13 15:05, Philip Rowlands wrote:
In the slow case, wc is spending most of its time in iswprint /
wcwidth / iswspace. Perhaps wc could learn a faster method of counting
utf-8 (https://stackoverflow.com/a/7298149); this may be worthwhile as
the trend to utf-8 everywhere marches on.
I can't explain without more digging why Python's string
decode('utf-8') is better optimised for length calculations.
On the surface, it seems easy to explain: the Python program is
just decoding UTF-8 and then taking the length. None of that
requires character classification and determination of display width.
If "wc -m" is doing something with display with, it's very different
from what the Python is doing.
What are the requirements underpinning "wc -m", and how do these
iswprint and iswspace functions fit into it?
POSIX says this: "The -c option stands for "character" count,
even though it counts bytes. This stems from the sometimes erroneous
historical view that bytes and characters are the same size.
Due to international requirements, the -m option (reminiscent of
"multi-byte") was added to obtain actual character counts."
I don't see how this amounts to having to call iswspace and all that.
Nowhere does POSIX say that the display width of a character
has to be obtained in "wc" and I don't see that in the GNU documentation