Re: [Mediawiki-l] Web page source - "strange" characters

Platonides Wed, 24 Mar 2010 17:27:12 -0700

lmhelp2 wrote:
> 
> ----------------------------------------------------------------------
> Hi Alexis,
> 
> Thank you, I hadn't realized...
> and "Platonides"'s post explains why...!
> 
> ----------------------------------------------------------------------
> Hi Platonides,
> 
> Thanks a lot for your explanations and examples!
> 
> Line 1: "E       t       o      i        l       é             <space>" 
> Line 2:  0x45  0x74  0x6f  0x69  0x6c  0xe9         0x20
> Line 3:  0x45  0x74  0x6f  0x69  0x6c  0xc3 0xa9  0x20
> 
> Do we say:
> 
> ----- "Line 2" is the "iso-8859-1" representation of "Line 1"?


Yes.

> ----- "Line 3" is the "utf-8" representation of "Line 1"?
Yes.


> ----- "Line 2" and "Line 3" are made of codepoints?
>
Line 2 and three are textual representation of the hex codes of how Line
1 would be written in their encodings.

A codepoint is a number which corresponds to a glyph. The character
'capital A' has the codepoint 65 for convention. We could all have agred
instead to give it the codepoint 1, or 25.

> Question: shouldn't we have 7 * 2 "codepoints" instead of 8?
> Maybe you omitted them, didn't you?

We have 7 codepoints, one per "letter". Note that this is independent of
the encoding.
If you are wondering why utf-8 uses 8 bytes instead of 14 (as would have
been used by utf-16), that's the beauty of utf-8. It will only use one
byte (like ASCII) for basic letters, it will use two for a text with
diacritics, Greek, Hebrew..., which are generally used less frequently,
three bytes for characters much much less frequent (like €), and four
for really odd ones, like Egyptian Hieroglyphics.
So it is quite compact, while still allowing the full Unicode.
There are other representations like UCS-4 easier to understand (four
bytes per character) but terribly inefficient.



> ----- "Line 1" is made of characters?

Yes. But character is often taken as synonim of byte, which in this
thread it is not.


> Let's consider:
> 
> Line 1: "E              t              o              i              l        
>       
> é              <space>" 
> Line 4:  0x00 0x45  0x00 0x74  0x00 0x6f  0x00 0x69  0x00 0x6c  0x00 0xe9 
> 0x00 0x20
> Line 5:  0x45 0x00  0x74 0x00  0x6f 0x00  0x69 0x00  0x6c 0x00  0xe9 0x00 
> 0x20 0x00
> 
> ----- Is "Line 4" the "utf-16 BE" representation of "Line 1"?
> ----- Is "Line 5" the "utf-16 LE" representation of "Line 1"?

Yes and yes.


> Can you tell me where to find the various tables which 
> allow one to find a given representation ("iso-8859-1", 
> "utf-8", "utf-16 BE", "utf-16 LE") for a given "character"?

You may find this app useful http://www.ltg.ed.ac.uk/~richard/utf-8.cgi

> I mean, how did you know that:
> - 0xe9 is the "iso-8859-1" representation of é?
You indirectly told me when mentioning the %E9 :)

> - 0xc3 0xa9 is the "utf-8" representation of é?
I did   echo é | hd  in a utf-8 terminal.

> - 0x00 0xe9 is the "utf-16 BE" representation of é?
> - 0xe9 0x00 is the "utf-16 LE" representation of é?

For low values, utf-16 is the same as the codepoint number, stored in
two bytes. So almost always you end up placing the hex code of the
codepoint plus a null byte (high order byte 0).
If you store the number in Big Endian, the high part will appear first,
else it will appear later.

UCS-2 keeps the codepoint in two bytes and simply stores it (in big
endian or little endian). Since that restricts the characters you could
use (what, I can't store Phoenician in ucs-2??), utf-16 uses some
special values (the surrogate pairs) to take four bytes instead of two
and provide the full unicode.


> (Apart from the fact that you are a super-pro :) of course).

Hehe, thanks :)

> Please tell me if I misunderstood something and correct me if I
> didn't use the proper terminology :) .



_______________________________________________
MediaWiki-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l

Re: [Mediawiki-l] Web page source - "strange" characters

Reply via email to