Hi David,

Thanks for the update. I translated the characters from UCS-2 to UTF-8 using
C APIs. Actually i took these chinese characters(您是如) from Goolge Translate
and used in xml file to test the unicode support.When i translated these
characters from UCS-2 to UTF-8 using C APIs, i got these characters(귦꺡髧„).
Now i am not getting the errors from xerces parser.

But i have a question. Will the characters themselves change from one format
to another format? If i have a string "abcd", will it change from one format
to another format? I understand the encoding in different formats is
different but i do not understand why the characters themselves are chaning
from one format to another format. Any information related to this will be a
great help to me.

Thanks,
Jaya Nageswar.

On Wed, Sep 3, 2008 at 3:18 AM, David Bertoni <[EMAIL PROTECTED]> wrote:

> Jaya Nageswar wrote:
>
>> Hi,
>>
>> I am using xerces c 1.7.0 (ICU build) for parsing xml files. I have some
>> special chinese characters in the xml file. So i am using ICU build to
>> support unicode. I defined encoding as UTF-8
>>
>> *<?xml version="1.0" encoding="UTF-8"?>*
>>
>> Part of xml file contains the has the following chinese characters.
>>  *      <Convert>
>>            <FromValue>TRUE</FromValue>
>>            <ToValue>您是如</ToValue>
>>        </Convert>
>>        <Convert>
>>            <FromValue>FALSE</FromValue>
>>            <ToValue>您好</ToValue>
>>        </Convert>*
>>
>> I am using DOM to prase the xml file. I have the following code for DOM
>> parsing
>>
>> *    static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
>>    DOMImplementation *impl =
>> DOMImplementationRegistry::getDOMImplementation(gLS);
>>    DOMBuilder        *CtlParser =
>>
>> ((DOMImplementationLS*)impl)->createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS,
>> 0);*
>>
>> *    CtlParser->setFeature(XMLUni::fgDOMNamespaces, true);
>>    CtlParser->setFeature(XMLUni::fgXercesSchema, true);
>>    CtlParser->setFeature(XMLUni::fgXercesSchemaFullChecking, true);
>>    CtlParser->setFeature(XMLUni::fgDOMValidateIfSchema, true);*
>>
>> *    //create our error handler and install it
>>    XMLErrorHandler errorHandler;
>>    CtlParser->setErrorHandler(&errorHandler);
>>
>>    CtlDoc = CtlParser->parseURI(XMLFilePath);
>>     if(errorHandler.getSawErrors())
>>     {
>>           cout<<errorHandler.ReturnErrorMessage();
>>     } *
>>
>>
>> I am getting the following error.
>> *Message: An exception occurred! Type:UTFDataFormatException,
>> Message:invalid byte 2 (�) of a 2-byte sequence.*
>>
> This indicates your file is not really encoded in UTF-8.
>
>
>> I do not understand why i am getting this error even though i am using
>> xercec-c ICU build. ICU build is supposed to work with unicode characters.
>> If i remove the chinese characters, i am not getting any error message
>> while
>> parsing.
>>
> Xerces-C supports UTF-8 even without using the ICU transcoders.
>
>
>> If any body worked with unicode in xerces-c, please help me. Did i miss
>> any
>> of the parser settings for unicode?
>>
> Your file is not encoded in UTF-8, so the parser reports an error.  You can
> either fix the file so it's properly encoded, or update the encoding in the
> XML declaration to reflect the actual encoding.
>
> Dave
>

Reply via email to