Matthew Boulter wrote:
Well I will try to migrate to using the non-deprecated DOM if we have no other
choice.
At the moment time constraints are quite tight.
In light of that I am looking at a stop-gap solution. Looking at the
constructor for DOMString(const char *)
I see that it puts the char * thru a transcoder itself
if (!uniConverter->transcode(srcString, strData, srcLen) ||
(XMLString::stringLen(strData) != srcLen))
I added some debug statements to the DOMString() method and a println() at the
bottom of the method, here's the result.
[----DOMString::DOMString(const char *srcString)----]
srcString == 110,Brzeźna---Brzeźna
calling print() 110,Brzeźna---Brzeź
So the loss of those two chars (which are NOT UTF-8 chars, we lose 1 char at
the end of the line per UTF8 character in the string?!)
does indeed happen inside this constructor.
You've got me completely confused here. I'm not sure what you mean by
characters that are not UTF-8 characters. UTF-8 encodes Unicode code
points, which are numeric representations for abstract characters.
UTF-8 can encode any valid Unicode character. There is no such thing as
a "UTF-8 character."
Perhaps you mean those are not UTF-8 code units? In other words, are
you saying your data is not encoded in UTF-8?
It would be helpful if you could provide the actual hexadecimal values
of the bytes of your example strings. That would clarify this
situation, because we could see your actual data. It would also be
helpful if you looked at the DOMString itself in the debugger, so you
can provide the actual UTF-16 code units.
Given that I am using the deprecated DOM and thus DOMString everywhere is there
any hope of a stopgap measure
or do I have to bite the transition-bullet now.
There's nothing wrong with what DOMString is doing. The constructor
that takes a const char* assumes the data is in the local code page, so
it uses the local code page transcoder.
Since your data contains characters that are not in the local code page,
you should avoid using the DOMString constructor that takes a const
char*, and avoid any of the local code page transcoding functions, such
as XMLString::transcode().
Instead, you need to use a UTF-8 transcoder to convert your data from
UTF-8 to UTF-16 (char* to XMLCh*). You'll also need to use the same
converter whenever you convert from UTF-16 to UTF-8 (XMLCh* to char*).
Of course, this assumes that you are certain that you incoming data is
always in UTF-8 and that you always want to transcode back to UTF-8 from
UTF-16.
Note that migrating from the deprecated DOM to the new DOM will not fix
these bugs in your code.
Dave