lmhelp2 wrote:
>
> ----------------------------------------------------------------------
> Hi Alexis,
>
> Thank you, I hadn't realized...
> and "Platonides"'s post explains why...!
>
> ----------------------------------------------------------------------
> Hi Platonides,
>
> Thanks a lot for your explanations and examples!
>
> Line 1: "E t o i l é <space>"
> Line 2: 0x45 0x74 0x6f 0x69 0x6c 0xe9 0x20
> Line 3: 0x45 0x74 0x6f 0x69 0x6c 0xc3 0xa9 0x20
>
> Do we say:
>
> ----- "Line 2" is the "iso-8859-1" representation of "Line 1"?
Yes.
> ----- "Line 3" is the "utf-8" representation of "Line 1"?
Yes.
> ----- "Line 2" and "Line 3" are made of codepoints?
>
Line 2 and three are textual representation of the hex codes of how Line
1 would be written in their encodings.
A codepoint is a number which corresponds to a glyph. The character
'capital A' has the codepoint 65 for convention. We could all have agred
instead to give it the codepoint 1, or 25.
> Question: shouldn't we have 7 * 2 "codepoints" instead of 8?
> Maybe you omitted them, didn't you?
We have 7 codepoints, one per "letter". Note that this is independent of
the encoding.
If you are wondering why utf-8 uses 8 bytes instead of 14 (as would have
been used by utf-16), that's the beauty of utf-8. It will only use one
byte (like ASCII) for basic letters, it will use two for a text with
diacritics, Greek, Hebrew..., which are generally used less frequently,
three bytes for characters much much less frequent (like €), and four
for really odd ones, like Egyptian Hieroglyphics.
So it is quite compact, while still allowing the full Unicode.
There are other representations like UCS-4 easier to understand (four
bytes per character) but terribly inefficient.
> ----- "Line 1" is made of characters?
Yes. But character is often taken as synonim of byte, which in this
thread it is not.
> Let's consider:
>
> Line 1: "E t o i l
>
> é <space>"
> Line 4: 0x00 0x45 0x00 0x74 0x00 0x6f 0x00 0x69 0x00 0x6c 0x00 0xe9
> 0x00 0x20
> Line 5: 0x45 0x00 0x74 0x00 0x6f 0x00 0x69 0x00 0x6c 0x00 0xe9 0x00
> 0x20 0x00
>
> ----- Is "Line 4" the "utf-16 BE" representation of "Line 1"?
> ----- Is "Line 5" the "utf-16 LE" representation of "Line 1"?
Yes and yes.
> Can you tell me where to find the various tables which
> allow one to find a given representation ("iso-8859-1",
> "utf-8", "utf-16 BE", "utf-16 LE") for a given "character"?
You may find this app useful http://www.ltg.ed.ac.uk/~richard/utf-8.cgi
> I mean, how did you know that:
> - 0xe9 is the "iso-8859-1" representation of é?
You indirectly told me when mentioning the %E9 :)
> - 0xc3 0xa9 is the "utf-8" representation of é?
I did echo é | hd in a utf-8 terminal.
> - 0x00 0xe9 is the "utf-16 BE" representation of é?
> - 0xe9 0x00 is the "utf-16 LE" representation of é?
For low values, utf-16 is the same as the codepoint number, stored in
two bytes. So almost always you end up placing the hex code of the
codepoint plus a null byte (high order byte 0).
If you store the number in Big Endian, the high part will appear first,
else it will appear later.
UCS-2 keeps the codepoint in two bytes and simply stores it (in big
endian or little endian). Since that restricts the characters you could
use (what, I can't store Phoenician in ucs-2??), utf-16 uses some
special values (the surrogate pairs) to take four bytes instead of two
and provide the full unicode.
> (Apart from the fact that you are a super-pro :) of course).
Hehe, thanks :)
> Please tell me if I misunderstood something and correct me if I
> didn't use the proper terminology :) .
_______________________________________________
MediaWiki-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-l