Re: Words containing Umlaute not found by find and replace

Wil Hunt Mon, 20 Jun 2005 11:36:06 -0700

Wolfgang Engelmann wrote:

<snip>


> SuSe Linux uses the Unicode standard in the code form UTF-8 (I am not quite 
> sure here whether I understand completely the diference between 
> Unicode-Standard ISO-10646 and UTF-8)

Someone can correct me if I'm wrong, but I believe that Unicode is a
2-byte per character encoding scheme for many languages and symbols.
UTF-8 is a method of representing Unicode where most common characters,
such as the ASCII character set are represented using one byte.  The
consequence of this is that some less frequently used Unicode characters
actually take more bytes (up to six bytes for the largest Unicode
values) to represent in UTF-8.

In general, however, for English and most western languages, UTF-8
allows for Unicode representation without the cost of doubled storage size.

Here's a synopsis: (taken from
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8)

UTF-8 has the following properties:

    * UCS characters U+0000 to U+007F (ASCII) are encoded simply as
bytes 0x00 to 0x7F (ASCII compatibility). This means that files and
strings which contain only 7-bit ASCII characters have the same encoding
under both ASCII and UTF-8.
    * All UCS characters >U+007F are encoded as a sequence of several
bytes, each of which has the most significant bit set. Therefore, no
ASCII byte (0x00-0x7F) can appear as part of any other character.
    * The first byte of a multibyte sequence that represents a non-ASCII
character is always in the range 0xC0 to 0xFD and it indicates how many
bytes follow for this character. All further bytes in a multibyte
sequence are in the range 0x80 to 0xBF. This allows easy
resynchronization and makes the encoding stateless and robust against
missing bytes.
    * All possible 231 UCS codes can be encoded.
    * UTF-8 encoded characters may theoretically be up to six bytes
long, however 16-bit BMP characters are only up to three bytes long.
    * The sorting order of Bigendian UCS-4 byte strings is preserved.
    * The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

The following byte sequences are used to represent a character. The
sequence to be used depends on the Unicode number of the character:

U-00000000 - U-0000007F:        0xxxxxxx
U-00000080 - U-000007FF:        110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF:        1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF:        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF:        111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF:        1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
10xxxxxx

The xxx bit positions are filled with the bits of the character code
number in binary representation. The rightmost x bit is the
least-significant bit. Only the shortest possible multibyte sequence
which can represent the code number of the character can be used. Note
that in multibyte sequences, the number of leading 1 bits in the first
byte is identical to the number of bytes in the entire sequence.

--- End synopsis ---

So, in essence, UTF-8 is like a compression algorithm for the Unicode
character set.

-- 
Wil Hunt
Geek in training.
Jack of few trades, master of none.

Re: Words containing Umlaute not found by find and replace

Reply via email to