RE: Losing UTF-8 characters at the end of a string

Matthew Boulter Thu, 11 Sep 2008 20:26:36 -0700

When I said UTF-8 character I meant Polish character, as strings that only 
contain 
latin1 characters don't lose any characters at the end of the string, whereas 
those
that do contain polish characters lose 1 character at the end of the string for 
each
polish character in the string.

Yes I am certain that the character data is in UTF-8... so are you able to show 
me an
example of using a transcoder based on the deprecated DOM?

/Matt.

-----Original Message-----
From: David Bertoni [mailto:[EMAIL PROTECTED] 
Sent: Friday, 12 September 2008 6:20 AM
To: [email protected]
Subject: Re: Losing UTF-8 characters at the end of a string

Matthew Boulter wrote:
> Well I will try to migrate to using the non-deprecated DOM if we have no 
> other choice.
> At the moment time constraints are quite tight.
> 
> In light of that I am looking at a stop-gap solution. Looking at the 
> constructor for DOMString(const char *)
> I see that it puts the char * thru a transcoder itself
> 
>         if (!uniConverter->transcode(srcString, strData, srcLen) || 
> (XMLString::stringLen(strData) != srcLen))
> 
> I added some debug statements to the DOMString() method and a println() at 
> the bottom of the method, here's the result.
> 
>       [----DOMString::DOMString(const char *srcString)----]
>       srcString == 110,Brzeźna---Brzeźna
>       calling print() 110,Brzeźna---Brzeź
> 
> So the loss of those two chars (which are NOT UTF-8 chars, we lose 1 char at 
> the end of the line per UTF8 character in the string?!)
> does indeed happen inside this constructor.
You've got me completely confused here.  I'm not sure what you mean by 
characters that are not UTF-8 characters.  UTF-8 encodes Unicode code 
points, which are numeric representations for abstract characters. 
UTF-8 can encode any valid Unicode character.  There is no such thing as 
a "UTF-8 character."

Perhaps you mean those are not UTF-8 code units?  In other words, are 
you saying your data is not encoded in UTF-8?

It would be helpful if you could provide the actual hexadecimal values 
of the bytes of your example strings.  That would clarify this 
situation, because we could see your actual data.  It would also be 
helpful if you looked at the DOMString itself in the debugger, so you 
can provide the actual UTF-16 code units.

> 
> Given that I am using the deprecated DOM and thus DOMString everywhere is 
> there any hope of a stopgap measure
> or do I have to bite the transition-bullet now.
There's nothing wrong with what DOMString is doing.  The constructor 
that takes a const char* assumes the data is in the local code page, so 
it uses the local code page transcoder.

Since your data contains characters that are not in the local code page, 
you should avoid using the DOMString constructor that takes a const 
char*, and avoid any of the local code page transcoding functions, such 
as XMLString::transcode().

Instead, you need to use a UTF-8 transcoder to convert your data from 
UTF-8 to UTF-16 (char* to XMLCh*).  You'll also need to use the same 
converter whenever you convert from UTF-16 to UTF-8 (XMLCh* to char*).

Of course, this assumes that you are certain that you incoming data is 
always in UTF-8 and that you always want to transcode back to UTF-8 from 
UTF-16.

Note that migrating from the deprecated DOM to the new DOM will not fix 
these bugs in your code.

Dave

RE: Losing UTF-8 characters at the end of a string

Reply via email to