On Thu, Dec 10, 2015 at 6:09 PM, Paul Gilmartin <[email protected]> wrote: > On 2015-12-10 16:06, Mike Schwab wrote: >> https://en.wikipedia.org/wiki/UTF-8 >> B'0.......' is a 8 bit ASCII characters. >> > ITYM 7 bit. (Well, maybe.) Correct. 8 bits of data, with 1 length bit and 7 bits to determine the ASCII-7 character.
>> B'110.....' is a 16 bit UTF character. > (Or, perhaps, only Unicode 13.) Each continuation byte uses 2 bits to mark the byte as a continuation. So 5 bits to select the code page and 6 bits to select the character, so only 11 bits of data. >> B'1110....' is a 24 bit UTF character. > (Or, perhaps, only Unicode 20.) Each continuation byte uses 2 bits to mark the byte as a continuation. So 4 bits to select the code page and 12 bits to select the character, so only 16 bits of data. > Etc. > >> B'11110...' is a 32 bit UTF character. Each continuation byte uses 2 bits to mark the byte as a continuation. So 3 bits to select the code page and 18 bits to select the character, so only 21 bits of data. >> B'111110..' could be a 40 bit UTF character (none established). Each continuation byte uses 2 bits to mark the byte as a continuation. So 2 bits to select the code page and 24 bits to select the character, so only 26 bits of data. >> B'1111110.' could be a 48 bit UTF character (none established). Each continuation byte uses 2 bits to mark the byte as a continuation. So 1 bits to select the code page and 30 bits to select the character, so only 31 bits of data. >> B'11111110' could be a 56 bit UTF character (none established). Each continuation byte uses 2 bits to mark the byte as a continuation. So no bits to select the code page and 36 bits to select the character, so only 36 bits of data. >> B'11111111' could be a 64 bit UTF character (none established). Each continuation byte uses 2 bits to mark the byte as a continuation. So no bits to select the code page and 42 bits to select the character, so only 42 bits of data. >> B'10......' is a continuation UTF character after a previous leading >> character. >> B'10000000' is a padding UTF character and should be removed. > > -- gil > > ---------------------------------------------------------------------- > For IBM-MAIN subscribe / signoff / archive access instructions, > send email to [email protected] with the message: INFO IBM-MAIN -- Mike A Schwab, Springfield IL USA Where do Forest Rangers go to get away from it all? ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
