Re: UTF-8

Shmuel Metz (Seymour J.) Sat, 12 Dec 2015 16:57:06 -0800

In
<CAJTOO5-s0LoKFyQA0S3L8zkTbngP6q1LcUA_W_iosD4gT4r=g...@mail.gmail.com>,
on 12/11/2015
   at 09:17 AM, Mike Schwab <[email protected]> said:


>On Thu, Dec 10, 2015 at 6:09 PM, Paul Gilmartin
><[email protected]> wrote:
>> On 2015-12-10 16:06, Mike Schwab wrote:
>>> https://en.wikipedia.org/wiki/UTF-8
>>> B'0.......'  is a 8 bit ASCII characters.
>>>
>> ITYM 7 bit.  (Well, maybe.)
>Correct.  8 bits of data, with 1 length bit and 7 bits to determine
>the ASCII-7 character.

>>> B'110.....' is a 16 bit UTF character.
>> (Or, perhaps, only Unicode 13.)
>Each continuation byte uses 2 bits to mark the byte as a
>continuation. So 5 bits to select the code page and 6 bits to select
>the character, so only 11 bits of data.

>>> B'1110....' is a 24 bit UTF character.
>> (Or, perhaps, only Unicode 20.)

For a UTF-8 sequence beginning with B'1110', only 16 bits are encoded.
You need a sequence starting with B'11110' to encode 21 bits, which is
the larges that RFC 3629 allows. RFC allowed longer sequence beginning
with B'1111110' and B'1111110'.

For UTF-16, RFC allows encoding 16 bits in two octets and 20 bits in
four octets; not that the surrogate pairs are reserve and cannot
appear in valid Unicode.
 
-- 
     Shmuel (Seymour J.) Metz, SysProg and JOAT
     ISO position; see <http://patriot.net/~shmuel/resume/brief.html> 
We don't care. We don't have to care, we're Congress.
(S877: The Shut up and Eat Your spam act of 2003)

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to [email protected] with the message: INFO IBM-MAIN

Re: UTF-8

Reply via email to