Note, that UTF-8 characters can be counted by counting bytes with bit patterns 0xxxxxxx or 11xxxxxx: https://en.wikipedia.org/wiki/UTF-8#Description
So, general logic should be, that, if: a) locale setting is utf-8 (e.g. LANG=xx_XX.UTF-8), or b) first two bytes of file are 0xFE 0xFF https://en.wikipedia.org/wiki/Byte_order_mark then count bytes with bits 0xxxxxxx and 11xxxxxx. > You mailed submit@debbugs without specifying a Package:, so your bug > report ended up on the help-debbugs list. I have reassigned it to > coreutils. (Please note there is no "wc" package.) > > (My mailer is messing up the UTF-8 characters in your report. > Interested parties can see the original at http://debbugs.gnu.org/20751#5 .) > > Valdis V toli wrote: > > > Version: wc (GNU coreutils) 8.21 > > > > When 'wc -m' is invoked, it should print character count, but it counts > > incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6 > > bytes in them, but all have only two UTF-8 encoded characters, which you > > can see with any modern text editor. > > > > wc -c chows correct number of bytes: > > wc -c * > > 3 3bytes.txt > > 4 4bytes.txt > > 6 6bytes.txt > > 13 total > > > > But wc -m shows incorrect number of characters: > > wc -m * > > 3 3bytes.txt > > 3 4bytes.txt > > 3 6bytes.txt > > 9 total > > > > But should be: > > wc -m * > > 2 3bytes.txt > > 2 4bytes.txt > > 2 6bytes.txt > > 6 total > > > > I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64 > > GNU/Linux 3.13.0-53-generic kernel > > > > P.S. > > If attachments will not pass through system, you can test it by creating > > files with following content: > > > > 3bytes.txt: aa > > 4bytes.txt: aā > > 6bytes.txt: a > > Attachments at http://debbugs.gnu.org/20751#5
