RE: Losing UTF-8 characters at the end of a string

Matthew Boulter Wed, 10 Sep 2008 23:13:03 -0700

Well I will try to migrate to using the non-deprecated DOM if we have no other 
choice.
At the moment time constraints are quite tight.

In light of that I am looking at a stop-gap solution. Looking at the 
constructor for DOMString(const char *)
I see that it puts the char * thru a transcoder itself

        if (!uniConverter->transcode(srcString, strData, srcLen) || 
(XMLString::stringLen(strData) != srcLen))

I added some debug statements to the DOMString() method and a println() at the 
bottom of the method, here's the result.

        [----DOMString::DOMString(const char *srcString)----]
        srcString == 110,Brzeźna---Brzeźna
        calling print() 110,Brzeźna---Brzeź

So the loss of those two chars (which are NOT UTF-8 chars, we lose 1 char at 
the end of the line per UTF8 character in the string?!)
does indeed happen inside this constructor.

Given that I am using the deprecated DOM and thus DOMString everywhere is there 
any hope of a stopgap measure
or do I have to bite the transition-bullet now.

Thanks in advance for your help,
/Matt.

-----Original Message-----
From: David Bertoni [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 10 September 2008 5:29 AM
To: [email protected]
Subject: Re: Losing UTF-8 characters at the end of a string

Matthew Boulter wrote:
> Hi all, I just wanted some guidance of where to expend my investigation effort
> into this topic.
> 
> I have a MySQL database that contains names of some Polish tram stops that I 
> am 
> extracting and encoding as WBXML for transmission.
> 
> Now I find when I get them from the database all is good until I get to the 
> part where
> I'm at our DomToWbxml task.
> 
> I find if the string has a Polish character it loses a character from the end 
> of the string, 
> if there are two it loses two and so on.
> 
> I read Xerces is UTF-16? If so am I losing something (other than my mind) 
> going back to UTF-8 ?
> 
> Any help is greatly appreciated.
This is probably the number one problem people experience when using 
Xerces-C.

Please read the documentation carefully, as the transcoding API you're 
using is _not_ transcoding to UTF-8.  Rather, it is transcoding to the 
local code page, so the disappearing characters are probably not 
representable in the local code page.  Instead of using 
DOM_String::transcode(), you need to create a UTF-8 transcoder and use that.

Also, you're using the deprecated DOM, which will disappear in Xerces-C 
3.0.  I would suggest you update your code to use the new DOM.

For more information, please search the mailing list archives for 
"transcoding."  Here's a good place to start:

http://marc.info/?l=xerces-c-users&m=119514889329902&w=2

Dave

RE: Losing UTF-8 characters at the end of a string

Reply via email to