Giulio Troccoli wrote:


Linedata Services (UK) Ltd
Registered Office: Bishopsgate Court, 4-12 Norton Folgate, London, E1 6DB
Registered in England and Wales No 3027851    VAT Reg No 778499447

-----Original Message-----


From: David Bertoni [mailto:[email protected]]
Sent: 13 August 2009 18:38
To: [email protected]
Subject: Re: Invalid byte 1 (£) of a 1-byte sequence

Giulio Troccoli wrote:
Well, I configure the built as follows

runConfigure -paix -cxlc -xxlC_r

So it should have used the 'native' for transcoding. I'm
afraid I don't know about this to be of more help.
What could be happening is there's local code page
transcoding somewhere in your processing stream, and on
Windows, you're getting a Windows code page (probably 1252),
not UTF-8.

Note that Windows-1252 and ISO-8859-1 are not compatible, so
don't assume you can interchange them. Rather than hack
around your source files, you should make sure all of your
processing is in UTF-8.

You should also run the locale command on your AIX machine to
verify what code page it's using. If it's UTF-8, that would
be further evidence your application is doing inappropriate
local code page transcoding.

I'm a bit out of depth here so forgive me if I say soemthing really stupid 
(which is quite likely).

My XML document is not in UTF-8. The pound sign is just A3, not C2 A3.
Yes, but are you sure you're processing the exact same byte stream on AIX?

But I'm telling my application that the document IS in UTF-8 (using the 
encoding="UTF-8" option).

Windows correctly rejects it. AIX does not.
This would be a major bug if you were processing identical documents. I haven't used Xerces-C on AIX for a while, but when I used it, ill-formed UTF-8 byte sequences were rejected correctly.


When you say "make sure all of your processing is in UTF-8", I can't do that. 
The XML is not in UTF-8 and I can not change that (it's created by a C programme and I 
have no idea how to do that).
Then why is your program generating an incorrect encoding declaration? It seems to me the program is assuming it's generating UTF-8, but it's not. You need to fix that.


I ran some locale commands on my AIX box and here's the result

ibu...@kylie% locale
LANG=en_US
LC_COLLATE="en_US"
LC_CTYPE="en_US"
LC_MONETARY="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_MESSAGES="en_US"
LC_ALL=

ibu...@kylie% locale charmap
ISO8859-1

Is this why Xerces on AIX understands that A3 is in fact the pound sign?
Xerces-C doesn't rely on the locale settings, except when transcoding to and from the local code, using XMLString::transcode(), or when the local code page transcoder is used.

Perhaps you need to provide more of your code, so we can see how the data is getting to the parser.

Dave

Reply via email to