Well I will try to migrate to using the non-deprecated DOM if we have no other
choice.
At the moment time constraints are quite tight.
In light of that I am looking at a stop-gap solution. Looking at the
constructor for DOMString(const char *)
I see that it puts the char * thru a transcoder itself
if (!uniConverter->transcode(srcString, strData, srcLen) ||
(XMLString::stringLen(strData) != srcLen))
I added some debug statements to the DOMString() method and a println() at the
bottom of the method, here's the result.
[----DOMString::DOMString(const char *srcString)----]
srcString == 110,Brzeźna---Brzeźna
calling print() 110,Brzeźna---Brzeź
So the loss of those two chars (which are NOT UTF-8 chars, we lose 1 char at
the end of the line per UTF8 character in the string?!)
does indeed happen inside this constructor.
Given that I am using the deprecated DOM and thus DOMString everywhere is there
any hope of a stopgap measure
or do I have to bite the transition-bullet now.
Thanks in advance for your help,
/Matt.
-----Original Message-----
From: David Bertoni [mailto:[EMAIL PROTECTED]
Sent: Wednesday, 10 September 2008 5:29 AM
To: [email protected]
Subject: Re: Losing UTF-8 characters at the end of a string
Matthew Boulter wrote:
> Hi all, I just wanted some guidance of where to expend my investigation effort
> into this topic.
>
> I have a MySQL database that contains names of some Polish tram stops that I
> am
> extracting and encoding as WBXML for transmission.
>
> Now I find when I get them from the database all is good until I get to the
> part where
> I'm at our DomToWbxml task.
>
> I find if the string has a Polish character it loses a character from the end
> of the string,
> if there are two it loses two and so on.
>
> I read Xerces is UTF-16? If so am I losing something (other than my mind)
> going back to UTF-8 ?
>
> Any help is greatly appreciated.
This is probably the number one problem people experience when using
Xerces-C.
Please read the documentation carefully, as the transcoding API you're
using is _not_ transcoding to UTF-8. Rather, it is transcoding to the
local code page, so the disappearing characters are probably not
representable in the local code page. Instead of using
DOM_String::transcode(), you need to create a UTF-8 transcoder and use that.
Also, you're using the deprecated DOM, which will disappear in Xerces-C
3.0. I would suggest you update your code to use the new DOM.
For more information, please search the mailing list archives for
"transcoding." Here's a good place to start:
http://marc.info/?l=xerces-c-users&m=119514889329902&w=2
Dave