When I said UTF-8 character I meant Polish character, as strings that only contain latin1 characters don't lose any characters at the end of the string, whereas those that do contain polish characters lose 1 character at the end of the string for each polish character in the string.
Yes I am certain that the character data is in UTF-8... so are you able to show me an example of using a transcoder based on the deprecated DOM? /Matt. -----Original Message----- From: David Bertoni [mailto:[EMAIL PROTECTED] Sent: Friday, 12 September 2008 6:20 AM To: [email protected] Subject: Re: Losing UTF-8 characters at the end of a string Matthew Boulter wrote: > Well I will try to migrate to using the non-deprecated DOM if we have no > other choice. > At the moment time constraints are quite tight. > > In light of that I am looking at a stop-gap solution. Looking at the > constructor for DOMString(const char *) > I see that it puts the char * thru a transcoder itself > > if (!uniConverter->transcode(srcString, strData, srcLen) || > (XMLString::stringLen(strData) != srcLen)) > > I added some debug statements to the DOMString() method and a println() at > the bottom of the method, here's the result. > > [----DOMString::DOMString(const char *srcString)----] > srcString == 110,Brzeźna---Brzeźna > calling print() 110,Brzeźna---Brzeź > > So the loss of those two chars (which are NOT UTF-8 chars, we lose 1 char at > the end of the line per UTF8 character in the string?!) > does indeed happen inside this constructor. You've got me completely confused here. I'm not sure what you mean by characters that are not UTF-8 characters. UTF-8 encodes Unicode code points, which are numeric representations for abstract characters. UTF-8 can encode any valid Unicode character. There is no such thing as a "UTF-8 character." Perhaps you mean those are not UTF-8 code units? In other words, are you saying your data is not encoded in UTF-8? It would be helpful if you could provide the actual hexadecimal values of the bytes of your example strings. That would clarify this situation, because we could see your actual data. It would also be helpful if you looked at the DOMString itself in the debugger, so you can provide the actual UTF-16 code units. > > Given that I am using the deprecated DOM and thus DOMString everywhere is > there any hope of a stopgap measure > or do I have to bite the transition-bullet now. There's nothing wrong with what DOMString is doing. The constructor that takes a const char* assumes the data is in the local code page, so it uses the local code page transcoder. Since your data contains characters that are not in the local code page, you should avoid using the DOMString constructor that takes a const char*, and avoid any of the local code page transcoding functions, such as XMLString::transcode(). Instead, you need to use a UTF-8 transcoder to convert your data from UTF-8 to UTF-16 (char* to XMLCh*). You'll also need to use the same converter whenever you convert from UTF-16 to UTF-8 (XMLCh* to char*). Of course, this assumes that you are certain that you incoming data is always in UTF-8 and that you always want to transcode back to UTF-8 from UTF-16. Note that migrating from the deprecated DOM to the new DOM will not fix these bugs in your code. Dave
