Re: Gnome2 UTF-8 handling

Eneko Lacunza Tue, 16 Dec 2003 14:59:05 -0800

Hi all,

        I'm just caching old messages as it seems that this one didn't get any
response. I'll try to type in something :)


El jue, 04-12-2003 a las 12:11, Reinke Bonte escribió:
> > > > 2.)  Get rid of the "wide character set" and use utf-8 for the
> > > > user I/O as well as the internal calculations.
> > > #2 is the correct option.  We should just keep everything in UTF8.
> > Agreed, 100%.
> 
> If my understanding is not completely wrong, you have to choose the
> first option and stick with wide characters.

        No, it is not a must.

> There is no contradiction between "wide characters" and UTF-8. In fact,
> you need to use "wide characters" to properly handle UTF-8 encoded
> strings. Therefore #2 is not an option.

        It is perfectly possible to handle utf-8 encoded strings without using
C wide characters. It's just that you can't use standard str* functions
for some tasks, that's all. Glib/Gdk libraries have the replacement
functions for utf-8 strings, if I don't get it wrong.

> This is what my libc documentation says:
> [...]
> UTF-8 is an ASCII compatible encoding where ASCII characters are
> represented by ASCII bytes and non-ASCII characters by sequences of 2-6
> non-ASCII bytes, and finally UTF-16 is an extension of UCS-2 in which
> pairs of certain UCS-2 words can be used to encode non-BMP characters up
> to 0x10ffff.
> To represent wide characters the char type is not suitable. For this
> reason the ISO C standard introduces a new type which is designed to
> keep one character of a wide character string. To maintain the
> similarity there is also a type corresponding to int for those functions
> which take a single wide character.
> Data type: wchar_t
>     This data type is used as the base type for wide character strings.
> I.e., arrays of objects of this type are the equivalent of char[] for
> multibyte character strings. The type is defined in `stddef.h'.
> [...]

        I don't see why this means we need to use GdkWChar (note: it is not
wchat_t).

> > > The hard part is going to be converting the existing XML and
> > > database data from whatever it's currently using to UTF8.
> > We don't currently include an "encoding" in the XML data file.  That
> > could be used as a trigger to ask the user for the old encoding and
> > then convert the data to UTF-8.  A nice touch would be to scan the
> > file first looking for any characters with the high order bit set to
> > see if conversion is needed in the first place.
> I don't know about database data, but the XML file is a complete mess.
> You will not find any high order bit set in the XML file, because libxml
> has converted everything into HTML-entities. But unfortunately the wrong
> entities for every encoding != Latin1. Here a manual recoding of the
> XML-File is necessary, as I described twice here on this mailing list.

        I think it it perfectly possible to parse the XML file with it's
parser, then check all strings (unencoded from HTML-entities by the
parser).

Regards

_______________________________________________
gnucash-devel mailing list
[EMAIL PROTECTED]
http://www.gnucash.org/cgi-bin/mailman/listinfo/gnucash-devel

Re: Gnome2 UTF-8 handling

Reply via email to