2015-06-06 21:49:16 +0300, Valdis Vītoliņš: > Note, that UTF-8 characters can be counted by counting bytes with bit > patterns 0xxxxxxx or 11xxxxxx: > https://en.wikipedia.org/wiki/UTF-8#Description > > So, general logic should be, that, if: > a) locale setting is utf-8 (e.g. LANG=xx_XX.UTF-8), or > b) first two bytes of file are 0xFE 0xFF > https://en.wikipedia.org/wiki/Byte_order_mark > > then count bytes with bits 0xxxxxxx and 11xxxxxx. [...]
Except that only valid characters should be counted. And there, the definition of valid character is not always clear. At least an incorrect UTF-8 encoding can't count as valid characters. So printf '\300' | wc -m should return 0 as 11000000 alone is not a valid character so we can't use your algorithm without first verifying the validity of the input. Then the UTF-8 encoding of the UTF16 surrogate pairs (0xD800 to 0xDFFF) should probably be excluded as well: printf '\355\240\200' | wc -m should return 0 for instance.. And maybe code-points above 0x11FFFF now since Unicode seem to have given up on ever defining characters above that (probably because of the UTF16 limitation). Now even in the range 0 -> D700, E000-> 0x11FFFF, there are still thousands of code points that are not defined yet in the latest Unicode version. I suppose we can imagine locale definitions where each of the known characters are listed and the rest rejected... -- Stephane