Re: Losing UTF-8 characters at the end of a string

David Bertoni Thu, 11 Sep 2008 13:20:12 -0700

Matthew Boulter wrote:

Well I will try to migrate to using the non-deprecated DOM if we have no other 
choice.
At the moment time constraints are quite tight.


In light of that I am looking at a stop-gap solution. Looking at the 
constructor for DOMString(const char *)
I see that it puts the char * thru a transcoder itself

        if (!uniConverter->transcode(srcString, strData, srcLen) || 
(XMLString::stringLen(strData) != srcLen))

I added some debug statements to the DOMString() method and a println() at the 
bottom of the method, here's the result.

        [----DOMString::DOMString(const char *srcString)----]
        srcString == 110,Brzeźna---Brzeźna
        calling print() 110,Brzeźna---Brzeź

So the loss of those two chars (which are NOT UTF-8 chars, we lose 1 char at 
the end of the line per UTF8 character in the string?!)
does indeed happen inside this constructor.

You've got me completely confused here. I'm not sure what you mean bycharacters that are not UTF-8 characters. UTF-8 encodes Unicode codepoints, which are numeric representations for abstract characters.UTF-8 can encode any valid Unicode character. There is no such thing asa "UTF-8 character."

Perhaps you mean those are not UTF-8 code units? In other words, areyou saying your data is not encoded in UTF-8?

It would be helpful if you could provide the actual hexadecimal valuesof the bytes of your example strings. That would clarify thissituation, because we could see your actual data. It would also behelpful if you looked at the DOMString itself in the debugger, so youcan provide the actual UTF-16 code units.


Given that I am using the deprecated DOM and thus DOMString everywhere is there 
any hope of a stopgap measure
or do I have to bite the transition-bullet now.

There's nothing wrong with what DOMString is doing. The constructorthat takes a const char* assumes the data is in the local code page, soit uses the local code page transcoder.

Since your data contains characters that are not in the local code page,you should avoid using the DOMString constructor that takes a constchar*, and avoid any of the local code page transcoding functions, suchas XMLString::transcode().

Instead, you need to use a UTF-8 transcoder to convert your data fromUTF-8 to UTF-16 (char* to XMLCh*). You'll also need to use the sameconverter whenever you convert from UTF-16 to UTF-8 (XMLCh* to char*).

Of course, this assumes that you are certain that you incoming data isalways in UTF-8 and that you always want to transcode back to UTF-8 fromUTF-16.

Note that migrating from the deprecated DOM to the new DOM will not fixthese bugs in your code.


Dave

Re: Losing UTF-8 characters at the end of a string

Reply via email to