Hi all,
Platonides wrote: > We have 7 codepoints, one per "letter". Note that this is independent of > the encoding. > If you are wondering why utf-8 uses 8 bytes instead of 14 (as would have > been used by utf-16), that's the beauty of utf-8. It will only use one > byte (like ASCII) for basic letters, it will use two for a text with > diacritics, Greek, Hebrew..., which are generally used less frequently, > three bytes for characters much much less frequent (like €), and four > for really odd ones, like Egyptian Hieroglyphics. > So it is quite compact, while still allowing the full Unicode. > There are other representations like UCS-4 easier to understand (four > bytes per character) but terribly inefficient. I'm not an UTF expert, but a minor point is that East Asian languages (Japanese and Chinese) fit into the "three byte" region (I think). I think their entire alphabet is in the 3-byte region. On the other hand, the other non-Unicode encodings (Shift-JIS, EUC-JP, GB*, ISO-2022) use exactly two bytes. So, by using UTF-8, the text increases by 50%. I can't speak for both countries -- only the very small part I'm aware of -- but many e-mail programs and web pages still seem to use two-byte encodings (which probably include ASCII as a subset). I feel that UTF-8 isn't catching on very fast here, but (1) I don't know if that's true in other countries and (2) I don't know if this 50% increase in size is the show-stopper... Ray (Someone feel free to correct me if I'm wrong...) _______________________________________________ MediaWiki-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/mediawiki-l
