Hello,
I wrote this paper after doing several tests.
Please tell me what you think about.
For gtkg I'd like to add transliteration system,
to be able to do inter charsets searches.
Stephane
Gnutella Developer Forum S. Corbé
Juin 2003
Internationalization of Gnutella
This document is UTF-8 encoded.
Abstract
Actually doing a query in non roman letters on Gnutella is very
frustrating, the results are very poor and/or are in another charset.
The Gnutella network has been developped by roman peoples, so the code
assumes that the filenames are in a roman alphabet. This paper will
show the problems that this dogma give for non roman languages, and will
try to give some answers. This paper is not (yet) a proposal.
1. Problems with the Current Model
Most of these problems come from the fact that the servers assume that
the encoding is latin1.
1.1 Strings manipulations
Roman alphabets has the particularity to have accents and case.
To have more hits to a query servents strip accents and lowercase
the queries.
Effect of stripping accents on a cyrillic string :
Окно настроек текста --> Ieii ianooiae oaenoa
Effect of lowercase a korean string (euc-kr) :
쇼핑·생활 --> 쇼吾·생瘟
1.2 Mix of enconding
For several languages it's possible to encode characters in several
ways, so it's split the network into "same encoding" areas.
An ISO-8859-7 string decoded with wingreek :
εκκρεμοτήτων --> ῢὤὤᾡῢὣὧᾥὓᾥῷὢ
It's often the case between different OS, but not only.
1.3 Actual use of UTF-8
1.3.1 In queries
Currently most of the servents accept UTF-8 queries, but none
generate it. It why a UTF-8 query has none or only bad results.
1.3.2 In query hits
The query hits don't use UTF-8, they are always sent with local charset.
The result is unreadable search results on the client screen.
Example with a viscii charset string (vietnam displayed with ISO-8859-1 :
(query was "kmap", "sang" or "bin")
phải đổi nguồn kmap sang dạng bin bằng -->
phäi ð±i ngu°n kmap sang dÕng bin b¢ng
It worse with ideograms (Big5 chinese with Windows CP-1250) :
大多數應該和 Shift 以及 Alt 鍵一起用 -->
¤j¦hĽĆŔł¸Ó©M Shift ĄH¤Î Alt [EMAIL PROTECTED]
1.3.3 Identification of a UTF-8 query
1.3.3.1 GGEP Q extention
The Q ext was done to add options in the queries.
With it it's possible to add logical operators for instance.
This ext provide one bit to say if the UTF-8 encoding is used or not.
In fact, this extention is not really used.
1.3.3.2 BOM prefix
Another way to know if a query is in UTF-8 is to see if a BOM prefix
exist in the query.
BOM is a tree bytes strings that is hightly improbable to find in roman
texts and invalid in UTF-8.
BOM is the string 0xEF 0xBB 0xBF
2. How to handle non roman charsets
Gnutella has hash keywords tables, exchanged between ultrapeer nodes,
so we must have an homogenous system to handle charset.
This exclude solutions like put the encoding in query extension.
The basic idea is to have ASCII keywords for roman alphabets, and UTF-8
keywords for the other. Because UTF-8 is a superset of ASCII we will have
UTF-8 QRP tables.
2.1 Cohabitation of common roman charsets.
The goal of this chapter is to show the rules to have ASCII keywords.
This chapter must be applied only if the locale is roman.
Accents and symbols are important for the reader but are annoying
for the searches. So we will do two keywords lists, one for the local
filters (ie with no replacement) and one for the hash table only is ASCII
(ie with replacement).
These replacements must be done also in sent queries.
All queries sent and ASCII keywords must be lowercase.
CP-1252 is a superset of ISO-8859-1, so all ISO-8859-1 texts can be displayed
with CP-1252 fonts.
This table lists the non-ASCII characters which exist in any of
the three character sets, and which either: do not exist in all
of them; or, exist but have different codes.
The HEX column is the hexadecimal code for ISO-8859-1, CP-1252
and MacRoman.
The ASCII column show a proposal of replacement.
+---------------------UTF8-------- HEX ----ASCII-----------NFKD --------+
| division slash ⁄ 2044 | -- -- DA | / | ⁄ (2044) |
| derivative ∂ 2202 | -- -- B6 | d | ∂ (2202) |
| delta ∆ 2206 | -- -- C6 | D | ∆ (2206) |
| SIGMA ∑ 2211 | -- -- B7 | S | ∑ (2211) |
| PI ∏ 220F | -- -- B8 | PI | ∏ (220F) |
| pi π 03C0 | -- -- B9 | pi | π (03C0) |
| integral ∫ 222B | -- -- BA | S | ∫ (222B) |
| square root √ 221A | -- -- C3 | sqrt | √ (221A) |
| wavy equals ≈ 2248 | -- -- C5 | = | ≈ (2248) |
| diamond ◊ 25CA | -- -- D7 | d | ◊ (25CA) |
| apple logo F8FF | -- -- F0 | @ | (F8FF) |
| semi-circular accent ˘ 02D8 | -- -- F9 | ^ | ̆ (0020 0306) |
| double backtick ˝ 02DD | -- -- FD | " | ̋ (0020 030B) |
| cedilla ˛ 02DB | -- -- FE | , | ̨ (0020 0328) |
| notequal ? ???? | -- -- AD | <> | ? (???? ????) |
| caron ˇ 02C7 | -- -- FF | ^ | ˇ (02C7) |
| dotless i ı 0131 | -- -- F5 | i | ı (0131) |
| infinity ∞ 221E | -- -- B0 | inf | ∞ (221E) |
| lessorequal ≤ 2264 | -- -- B2 | <= | ≤ (2264) |
| greaterorequal ≥ 2265 | -- -- B3 | >= | ≥ (2265) |
| low-9 quote ‚ 201A | -- 82 E2 | , | ‚ (201A) |
| f with hook ƒ 0192 | -- 83 C4 | f | ƒ (0192) |
| low-9 double quote „ 201E | -- 84 E3 | " | „ (201E) |
| ellipses … 2026 | -- 85 C9 | ... | ... (002E 002E 002E) |
| dagger † 2020 | -- 86 -- | + | ‡ (2020) |
| double dagger ‡ 2021 | -- 87 -- | ++ | ‡ (2021) |
| circumflex accent ˆ 02C6 | -- 88 F6 | ^ | ˆ (02C6) |
| per mille sign ‰ 2030 | -- 89 E4 | 0/00 | ‰ (2030) |
| S with caron Š 0160 | -- 8A -- | S | Š (0053 030C) |
| left-pointing angle ‹ 2039 | -- 8B DC | < | ‹ (2039) |
| OE ligature Œ 0152 | -- 8C CE | OE | Œ (0152) |
| left single quote ‘ 2018 | -- 91 D4 | ` | ‘ (2018) |
| right single quote ’ 2019 | -- 92 D5 | ' | ’ (2019) |
| left double quote “ 201C | -- 93 D2 | " | “ (201C) |
| right double quote ” 201D | -- 94 D3 | " | ” (201D) |
| bullet • 2022 | -- 95 A5 | * | • (2022) |
| en dash – 2013 | -- 96 D0 | _ | – (2013) |
| em dash — 2014 | -- 97 D1 | __ | — (2014) |
| small tilde ˜ 02DC | -- 98 F7 | ~ | ̃ (0020 0303) |
| trademark ™ 2122 | -- 99 AA | TM | TM (0054 004D) |
| s with caron š 0161 | -- 9A -- | s | š (0073 030C) |
| right-pointing angle › 203A | -- 9B DD | > | › (203A) |
| oe ligature œ 0153 | -- 9C CF | oe | œ (0153) |
| Y diaeresis Ÿ 0178 | -- 9F D9 | Y | Ÿ (0059 0308) |
| twosuperior ² 00B2 | B2 B2 -- | 2 | 2 (0032) |
| threesuperior ³ 00B3 | B3 B3 -- | 3 | 3 (0033) |
| onesuperior ¹ 00B9 | B9 B9 -- | 1 | 1 (0031) |
| one quarter ¼ 00BC | BC BC -- | 1/4 | 1⁄4 (0031 2044 0034 |
| onehalf ½ 00BD | BD BD -- | 1/2 | 1⁄2 (0031 2044 0032) |
| threequarters ¾ 00BE | BE BE -- | 3/4 | 3⁄4 (0033 2044 0034) |
| degree ° 00B0 | B0 B0 A1 | o | ° (00B0) |
| paragraph ¶ 00B6 | B6 B6 A6 | P | ¶ (00B6) |
| period centered · 00B7 | B7 B7 E1 | . | · (00B7) |
+-----------------------------------------------------------------------+
The previous table show that NFKD can help us to remove accent (cf Ÿ), but it
does not replace majority of the symbols and can insert space into strings (cf ),
which is not suitable because space is a keywords separator.
A direct byte replacement will be faster and better.
This table lists the non-ASCII characters which exist in all
three character sets, and which have the same codes in each.
+-----------------------------------------------+
| Name - Dec/Hex/Oct | Replacement|
+-----------------------------------------------|
| nobreakspace 160 A0 240 | |
| exclamdown ¡ 161 A1 241 | ! |
| cent ¢ 162 A2 242 | c |
| sterling £ 163 A3 243 | L |
| currency ¤ 164 A4 244 | c |
| yen ¥ 165 A5 245 | Y |
| brokenbar ¦ 166 A6 246 | | |
| section § 167 A7 247 | S |
| diaeresis ¨ 168 A8 250 | " |
| copyright © 169 A9 251 | C |
| ordfeminine ª 170 AA 252 | a |
| guillemotleft « 171 AB 253 | << |
| notsign ¬ 172 AC 254 | ! |
| hyphen 173 AD 255 | - |
| registered ® 174 AE 256 | R |
| macron ¯ 175 AF 257 | - |
| plusminus ± 177 B1 261 | +- |
| acute ´ 180 B4 264 | ' |
| mu µ 181 B5 265 | u |
| cedilla ¸ 184 B8 270 | , |
| masculine º 186 BA 272 | o |
| guillemotright » 187 BB 273 | >> |
| questiondown ¿ 191 BF 277 | ? |
| Agrave À 192 C0 300 | A |
| Aacute Á 193 C1 301 | A |
| Acircumflex  194 C2 302 | A |
| Atilde à 195 C3 303 | A |
| Adiaeresis Ä 196 C4 304 | A |
| Aring Å 197 C5 305 | A |
| AE Æ 198 C6 306 | AE |
| Ccedilla Ç 199 C7 307 | C |
| Egrave È 200 C8 310 | E |
| Eacute É 201 C9 311 | E |
| Ecircumflex Ê 202 CA 312 | E |
| Ediaeresis Ë 203 CB 313 | E |
| Igrave Ì 204 CC 314 | I |
| Iacute Í 205 CD 315 | I |
| Icircumflex Î 206 CE 316 | I |
| Idiaeresis Ï 207 CF 317 | I |
| ETH Ð 208 D0 320 | Dh |
| Ntilde Ñ 209 D1 321 | N |
| Ograve Ò 210 D2 322 | O |
| Oacute Ó 211 D3 323 | O |
| Ocircumflex Ô 212 D4 324 | O |
| Otilde Õ 213 D5 325 | O |
| Odiaeresis Ö 214 D6 326 | O |
| multiply × 215 D7 327 | x |
| Ooblique Ø 216 D8 330 | O |
| Ugrave Ù 217 D9 331 | U |
| Uacute Ú 218 DA 332 | U |
| Ucircumflex Û 219 DB 333 | U |
| Udiaeresis Ü 220 DC 334 | U |
| Yacute Ý 221 DD 335 | Y |
| THORN Þ 222 DE 336 | Th |
| ssharp ß 223 DF 337 | ss |
| agrave à 224 E0 340 | a |
| aacute á 225 E1 341 | a |
| acircumflex â 226 E2 342 | a |
| atilde ã 227 E3 343 | a |
| adiaeresis ä 228 E4 344 | a |
| aring å 229 E5 345 | a |
| ae æ 230 E6 346 | ae |
| ccedilla ç 231 E7 347 | c |
| egrave è 232 E8 350 | e |
| eacute é 233 E9 351 | e |
| ecircumflex ê 234 EA 352 | e |
| ediaeresis ë 235 EB 353 | e |
| igrave ì 236 EC 354 | i |
| iacute í 237 ED 355 | i |
| icircumflex î 238 EE 356 | i |
| idiaeresis ï 239 EF 357 | i |
| eth ð 240 F0 360 | dh |
| ntilde ñ 241 F1 361 | n |
| ograve ò 242 F2 362 | o |
| oacute ó 243 F3 363 | o |
| ocircumflex ô 244 F4 364 | o |
| otilde õ 245 F5 365 | o |
| odiaeresis ö 246 F6 366 | o |
| division ÷ 247 F7 367 | / |
| oslash ø 248 F8 370 | o |
| ugrave ù 249 F9 371 | u |
| uacute ú 250 FA 372 | u |
| ucircumflex û 251 FB 373 | u |
| udiaeresis ü 252 FC 374 | u |
| yacute ý 253 FD 375 | y |
| thorn þ 254 FE 376 | ph |
| ydiaeresis ÿ 255 FF 377 | y |
+-----------------------------------------------+
2.2 Non roman charset handling
To support such charsets we can use only one encoding : UTF-8
Several search relative questions appear with the languages using
these charsets.
2.2.1 Keywords delimiters
In chinese and japanese, word delimiter are rare, and often this
delimiter is a full-width space, which is not the same than ASCII space.
Example in chinese :
我能吞下玻璃而不伤身体
Users of these languages already used search engines without delimiters,
so it shouldn't be really a problem.
2.2.1 Same letter, several widths
In japanese computers japanese letters (kana) exist in normal and half-width,
latin letters exist in normal and full-width.
These half-width, full-width letters are not very used (because they are not
at all part of orthography or typography) but they don't match with normal
width letters.
Example of full-width latin letters :
C 言 語プログラムで「¥n」は「\n」と等価ではなく
Procedures of normalization of Unicode can transform the different widths
in normal width.
Effect of NFKC :
C 言 語プログラムで「¥n」は「\n」と等価ではなく
2.2.2 Same visual letter, two different strings
if a letter can be decomposed in several letters/symbols, UTF can encode
it with differents string.
All letters with accents are in this case but the most interresting in
this case is the hangul letters (korean) :
Korean characters aren't ideograms, they are composed of several letters
of a the korean alphabet.
The string 쇼핑·생활 is encoded like this :
#C1FC #D551 #00B7 #C0DD #D65C
And decomposed like this (it should appear as above on your screen) :
쇼핑·생활 which is encoded :
[#1109 #116D] [#1111 #1175 #11BC] #00B7 [#1109 #1162 #11BC] [#1112 #116A #11AF]
We can have the first sentence by applying NFKC on the second.
We can have the second sentence by applying NFKD on the first.
2.2.3 Normalization
We have several choices :
2.2.3.1 NFKD
The algorithm is :
1- Apply NFKD
2- Remove diacritics and co
3- Replace symbols
4- Go to 1 until no change
This algorithm has several problems :
- it can insert spaces
- it makes long strings
- it tranform radicaly the sentence :
がんぼう (desire) and かんぼう (a cold) --> かんほう (no
meaning)
Japanese sentences without ゙ and ゚ are difficult to understand.
This algorithm can replace accents :
εκκρεμοτήτων --> εκκρεμοτητων
2.2.3.2 NFKC
The algorithm is :
1- Replace symbols
2- Apply NFKC
4- Go to 2 until no change
This algorithm has several advantages :
- it can't insert spaces
- it makes short strings
- it doesn't tranform radicaly the sentence :
がんぼう --> がんぼう
This algorithm doesn't remove accents and diacritics.
2.2.3.3 K Problems
Some languages are corrupted by using NFKC or NFKD
Example with an armenian sentence :
Կրնամ ապակի ուտել և ինծի անհանգիստ
չըներ -->
Կրնամ ապակի ուտել եւ ինծի անհանգիստ
չըներ (NFKD)
Կրնամ ապակի ուտել եւ ինծի անհանգիստ
չըներ (NFKC)
With NFD or NFC the problem doesn't exist
2.2.3.4 Conclusion
It seems that the global algorithm don't exist, it depends of the
language, for instance :
Roman : ASCII
Greek : NFKD
Japanese : NFKC
Armenian : NFC