In <CAJTOO5-s0LoKFyQA0S3L8zkTbngP6q1LcUA_W_iosD4gT4r=g...@mail.gmail.com>, on 12/11/2015 at 09:17 AM, Mike Schwab <[email protected]> said:
>On Thu, Dec 10, 2015 at 6:09 PM, Paul Gilmartin ><[email protected]> wrote: >> On 2015-12-10 16:06, Mike Schwab wrote: >>> https://en.wikipedia.org/wiki/UTF-8 >>> B'0.......' is a 8 bit ASCII characters. >>> >> ITYM 7 bit. (Well, maybe.) >Correct. 8 bits of data, with 1 length bit and 7 bits to determine >the ASCII-7 character. >>> B'110.....' is a 16 bit UTF character. >> (Or, perhaps, only Unicode 13.) >Each continuation byte uses 2 bits to mark the byte as a >continuation. So 5 bits to select the code page and 6 bits to select >the character, so only 11 bits of data. >>> B'1110....' is a 24 bit UTF character. >> (Or, perhaps, only Unicode 20.) For a UTF-8 sequence beginning with B'1110', only 16 bits are encoded. You need a sequence starting with B'11110' to encode 21 bits, which is the larges that RFC 3629 allows. RFC allowed longer sequence beginning with B'1111110' and B'1111110'. For UTF-16, RFC allows encoding 16 bits in two octets and 20 bits in four octets; not that the surrogate pairs are reserve and cannot appear in valid Unicode. -- Shmuel (Seymour J.) Metz, SysProg and JOAT ISO position; see <http://patriot.net/~shmuel/resume/brief.html> We don't care. We don't have to care, we're Congress. (S877: The Shut up and Eat Your spam act of 2003) ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to [email protected] with the message: INFO IBM-MAIN
