2015-05-13 03:00:48 +0100, Pádraig Brady: [...] > Yes. You could filter with sed to adjust: > > sed 's/././g' | wc -L # count chars > LC_ALL=C sed 's/././g' | wc -L # count bytes [...]
Note that unicode code points D800 to DFFF (reserved for UTF-16 encoding) and 110000 to 7FFFFFFF now that they've given up on ever having anything above 10FFFF) are not characters. Still GNU sed considers their UTF-8 encodings (as per the original UTF-8 encoding, before it got limited to 4 bytes) as characters. $ printf '\ud800\udfff\U110000\U7fffffff\n' | sed s/././g | wc -L 4 (I'm not sure I'd object to that though). Other byte sequences that don't form valid characters are not: $ printf '\x80\xff' | sed s/././g | wc -L 0 -- Stephane
