On Thu, Dec 10, 2015 at 6:09 PM, Paul Gilmartin
<[email protected]> wrote:
> On 2015-12-10 16:06, Mike Schwab wrote:
>> https://en.wikipedia.org/wiki/UTF-8
>> B'0.......'  is a 8 bit ASCII characters.
>>
> ITYM 7 bit.  (Well, maybe.)
Correct.  8 bits of data, with 1 length bit and 7 bits to determine
the ASCII-7 character.

>> B'110.....' is a 16 bit UTF character.
> (Or, perhaps, only Unicode 13.)
Each continuation byte uses 2 bits to mark the byte as a continuation.
So 5 bits to select the code page and 6 bits to select the character,
so only 11 bits of data.

>> B'1110....' is a 24 bit UTF character.
> (Or, perhaps, only Unicode 20.)
Each continuation byte uses 2 bits to mark the byte as a continuation.
So 4 bits to select the code page and 12 bits to select the character,
so only 16 bits of data.

> Etc.
>
>> B'11110...' is a 32 bit UTF character.
Each continuation byte uses 2 bits to mark the byte as a continuation.
So 3 bits to select the code page and 18 bits to select the character,
so only 21 bits of data.

>> B'111110..' could be a 40 bit UTF character (none established).
Each continuation byte uses 2 bits to mark the byte as a continuation.
So 2 bits to select the code page and 24 bits to select the character,
so only 26 bits of data.

>> B'1111110.' could be a 48 bit UTF character (none established).
Each continuation byte uses 2 bits to mark the byte as a continuation.
So 1 bits to select the code page and 30 bits to select the character,
so only 31 bits of data.

>> B'11111110' could be a 56 bit UTF character (none established).
Each continuation byte uses 2 bits to mark the byte as a continuation.
So no bits to select the code page and 36 bits to select the
character, so only 36 bits of data.

>> B'11111111' could be a 64 bit UTF character (none established).
Each continuation byte uses 2 bits to mark the byte as a continuation.
So no bits to select the code page and 42 bits to select the
character, so only 42 bits of data.

>> B'10......' is a continuation UTF character after a previous leading 
>> character.
>> B'10000000' is a padding UTF character and should be removed.
>
> -- gil
>
> ----------------------------------------------------------------------
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to [email protected] with the message: INFO IBM-MAIN



-- 
Mike A Schwab, Springfield IL USA
Where do Forest Rangers go to get away from it all?

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN
  • UTF-8 Mike Schwab

Reply via email to