2015-05-13 03:00:48 +0100, Pádraig Brady:
[...]
> Yes. You could filter with sed to adjust:
> 
>          sed 's/././g' | wc -L    # count chars
> LC_ALL=C sed 's/././g' | wc -L    # count bytes
[...]

Note that unicode code points D800 to DFFF (reserved for UTF-16
encoding) and 110000 to 7FFFFFFF now that they've given up on
ever having anything above 10FFFF) are not characters.

Still GNU sed considers their UTF-8 encodings (as per the
original UTF-8 encoding, before it got limited to 4 bytes)
as characters.

$ printf '\ud800\udfff\U110000\U7fffffff\n' | sed s/././g | wc -L
4

(I'm not sure I'd object to that though).

Other byte sequences that don't form valid characters are not:

$ printf '\x80\xff' | sed s/././g | wc -L
0

-- 
Stephane


Reply via email to