Quoting Josh Sled <[EMAIL PROTECTED]>:

On Fri, 2006-02-03 at 16:24 -0500, Derek Atkins wrote:
I think it's a major issue that someone in an ascii-like but
non-latin1 locale will get garbage during the default upgrade path.
libxml doesn't really provide a way to do proper detection, and 1.8
doesn't include an encoding in the data file..  Unfortunately the XML
spec says that the lack of an encoding parameter means the data is in
utf-8, but that's not the case in 1.8 -- the data is in whatever
locale the user was using.

So, how do we solve this?

We can look for the presence of the "encoding" attribute on the
<?xml ...?> header.

If present, then libxml will do the appropriate encoding conversion.

I'm not worried about the case where the encoding exists.  Yes, libxml will
do the right thing.  The problem is the case without the encoding, but
where the data isn't utf-8.

If not, then we believe the file was written by 1.8.   As such, we
should set libxml to believe that the encoding is the system-default as
determined from
http://gtk.org/api/2.6/glib/glib-Character-Set-Conversion.html#g-get-charset .
It may require a re-parse of the file to get encoding-conversion done;
I'm not sure when it's performed by libxml.

This file [[[

#include <libxml/parser.h>
#include <stdio.h>

int
main(int argc, char **argv)
{
 xmlDocPtr xml = xmlReadFile(argv[1], NULL, 0);
 printf("encoding: [%s]\n", xml->encoding);
}

]]] compiled with [[[
gcc `xml2-config --cflags --libs` -o xml-test xml-test.c
]]] shows that (xmlDocPtr)->encoding contains what we want to know: it's
set when <?xml [...] encoding="whatever"?> is set and NULL otherwise.

See http://mail.gnome.org/archives/xml/2001-July/msg00165.html for why
this is somewhat problematic.  "might be due to a confusion between locale
and encoding"...

Personally, I kinda like the approach in
http://mail.gnome.org/archives/xml/2001-July/msg00164.html

However I wonder if we want to bring user input into the foray?  Should
we ask the user to choose a charset, or somehow notify the user to check
the data.  And if they check it and the conversion was wrong, what
do we do then?

Also, we should really make sure that if a user is running g2 in a
non-utf8 locale that the data output really /IS/ utf8.  There's lots
of places where we're trusting libxml2 to do what we want, but have
we really verified and tested that it's actually doing what we want?

Any KOI8-R users willing to help us test?

-derek

--
      Derek Atkins, SB '93 MIT EE, SM '95 MIT Media Laboratory
      Member, MIT Student Information Processing Board  (SIPB)
      URL: http://web.mit.edu/warlord/    PP-ASEL-IA     N1NWH
      [EMAIL PROTECTED]                        PGP key available

_______________________________________________
gnucash-devel mailing list
[email protected]
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

Reply via email to