Re: utf-8 encoding scheme

Jeu George Thu, 13 Jul 2000 02:32:18 -0700


On 12 Jul 2000, H. Peter Anvin wrote:

> Followup to:  <[EMAIL PROTECTED]>
> By author:    Jeu George <[EMAIL PROTECTED]>
> In newsgroup: linux.utf8
> >
> > 
> > Hello,
> > 
> >     The utf-8 encoding scheme goes like this
> >   for
> >   1-byte characters 0xxxxxxx 
> >   2-byte characters 110xxxxx 10xxxxxx
> >   3-byte characters 1110xxxx 10xxxxxx
> > 
> 
> 4-byte characters     11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
> 5-byte characters     111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 6-byte characters     1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
> 
> > here the bits marked x are used up for the actuall encoding of characters
> > i would like to know the way these bits are used to code a particular
> > charter, also is this dependent on the operating system, can u provide a
> > program which checks finds this or any link that provides information
> > about this
> 
> The bits are encoded bigendian (MSB first), i.e. the way you would
> read the bits when written in the above form.



> 
> It is also very important to realize that ONLY THE SHORTEST POSSIBLE
> SEQUENCE IS LEGAL.  This is incredibly important, since any misguided

*********************************************************
Could u ellborate on this with some examples or something.
The example given below was not clear to me.
Thanks for the help any way
*********************************************************


for a two byte long character.
where will the MSB be
in the 4th bit of the first byte from the left or on the 3rd bit of the
second byte from the left

Will this be OS dependant. ie the arrangement of bits

How is the null character going to be??  00000000 ??
but u have mentioned something else below
I thought that all the ascii character were be retained in UTF-8 .
That is the major reason why 1 byte long charcters will always have the
MSB as 0. am i right??



> attempt to "be liberal in what you accept" without addition of an
> explicit canonicalization step would lead to the kind of security
> holes that Microsoft web-related applications have been so full of,
> because MS operating systems have way too many ways to say the same
> thing.
> 
> Thus, the character K <U+004B> is encoded as:
> 
>       01001011
> 
> The alternate spelling
> 
>       11000001 10001011
> 
> ... is not the character K <U+004B> but INVALID SEQUENCE.  One
> possible thing to do in a decoder is to emit U+FFFD SUBSTITUTION
> CHARACTER on encountering illegal sequences.

WHat is this U+FFFD SUBSTITUTION  about exactly could you ellaborate on
this also??


Regards,
Jeu



> 
>       -hpa
> 
> -- 
> <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
> "Unix gives you enough rope to shoot yourself in the foot."
> http://www.zytor.com/~hpa/puzzle.txt
> -
> Linux-UTF8:   i18n of Linux on all levels
> Archive:      http://mail.nl.linux.org/lists/
> 

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: utf-8 encoding scheme

Reply via email to