Re: [Gtk-gnutella-devel] [OT] A multibyte character on the Gnet

Christian Biere Thu, 02 Dec 2004 16:42:09 -0800

Daichi Kawahata wrote:
> With the ICU library, search results and monitor pane have a capability to
> display of multibyte character from LimeWire.


Does that mean, there's a problem with such results when not using ICU? ICU
is actually only used to match queries and canonicalize (outgoing) queries
if I remember correctly. It shouldn't affect viewing at all. Which GUI did
you use, GTK1 or GTK2? For non-latin languages, GTK+ 2.x is definitely first
choice as it uses UTF-8 everywhere and far better support for unicode. Well,
at least here the GTK+ 1.2.x looks completely unusable with Japanese while
GTK+ 2.x looks flawless. It might just be a font problem with GTK+ 1.2.x,
though.

> One thing is for non-multibyte user; How would you think these multibyte
> characer drifting on the Gnet, and what your workarounds are?

All peers must use UTF-8 and only UTF-8 encoded queries and results. There's
probably still quite an amount of improperly encoded of those on the network
i.e., strings in the native encoding of the sender's platform.

> Seconds' for implementer especially; First of all, it's not necessarily
> the multibyte issue will be tell from now on. Well then, as far as Japanese
> is concerned, I've come across the three situation:
 
> * displayed normally, indeed normally.
 
> * displayed on the grade can be read, however (semi)voiced consonant marks
>   had replaced by underscore or question mark.

The "special" question mark is the official unicode "replacement" character.
It's not quite clear how it gets there. It might have been sent by the
remote peer as-is, it might have been put there by ICU (possibly even iconv)
due to an invalid character or the actual character might be missing from
the used font set.
The underscore is probably created by gtk-gnutella itself due to a conversion
problem (invalid or unexpected encoding). We don't use the official unicode
replacement character there because it would often unnecessarily enforce
UTF-8 (instead of plain ASCII) encoding of string and it's much more 
inconvinient to handle in filenames (at least in a terminal). If string
is not UTF-8 encoded, gtk-gnutella can only guess the used encoded which
means it falls back to used locale character set boldly assuming that the
user is rather interested in search results from users/machines using the
same locale settings.

> * displayed only ASCII, Japanese is unreadable.

What means "unreadable"? Only underscores and question marks, or what?
 
> I've been thinking these differences
> are due to the encoding, and guessing the first is UTF-8 was converted from
> UTF-32 validly, the second is UTF-8 was EUC originally, and the third is
> ShiftJIS presented by Microsoft(R).

gtk-gnutella will only convert strings that are not valid UTF-8 encoded. I
don't know your locale settings. If you used EUC and the remote peer sends
ShiftJIS (which is illegal and a bug in the remote peer), the conversion
fails and you'll see a broken string (with a lot of underscores or
question marks). 

> If it was coused by LimeWire's server side, not within a ICU's converter,
> it might be nothing to do. However if not, is there any workaround?

LimeWire (due to Java) uses UTF-16 internally and emits only UTF-8 encoded
search results - I'm not sure whether composed or decomposed. During my
tests I didn't notice too many broken results that is most results with
(probably) Japanese filenames don't contain any characters that imply a
conversion error. Well, I can't read it so if the conversion produced
crazy output which still looks like Kanji it's unlikely I would notice.
Personally, I never used ICU though. For gtk-gnutella it's optimal to
use a locale with UTF-8 encoding (and if necessary override the language
setting). For example:

        LC_ALL=ja_JP.UTF-8

or
        unset LC_ALL
        LC_CTYPE=en_US.UTF-8
        LC_MESSAGES=ja_JP

The actual values are operating system dependent. Usually, "locale -a"
shows all available locale variants.

-- 
Christian

pgpDUaqRr6PKV.pgp
Description: PGP signature

Re: [Gtk-gnutella-devel] [OT] A multibyte character on the Gnet

Reply via email to