Invalid XML Characters: when valid UTF8 does not mean valid XML

 

            Googling for "valid xml characters" ANSWERED MY QUESTION TO SOME 
DEGREE.

 

            So let me related the problem I am faced with in a attempt that 
somebody has a clever solution for us for all the folk reading this mails.

 

            Our Legacy Server is OpenVMS + PASCAL + C + C++ Code supporting 
ISO-LATIN-8859-1

 

            Strings are encoded as 8 bit quantities allowing a coverage of the  
Western  European Languages.

 

            When we have to pass a string back to the client we need to convert 
them in the legacy server into UTF8 such that "ä ö ü é è à etc" are converted 
to the proper UTF8 double byte sequence

 

            This works nicely down to the client.

 

            But is this enough that your axis2/engine working in conjunction 
with a) AXOM -> org.apache.axiom.om.impl.builder.StAXOMBuilder and b) HTTP 
Transport etc. does not raise an exception, an axis fault.

 

            NO

 

            UTF8 does not mean you have valid XML characters! You should be 
aware of that!

 

            So it is as it is for now - the - assumed-correct-XML string with a 
FormFeed cariageReturn or LineFeed in it is not a valid-XML-string

 

            Conversion from ISO-LATIN-8859-1 to UTF8 is not enough and 

 

            Conversion must be enhanced by a the proper escaping technique to 
tranceive valid-XML characters through AXIOM AXUTIL AXIS2 spaces.

 

            We have to convert Microsoft Word documents containing word special 
characters into proper text characters then convert it to UTF8 and send it to 
the server where they get converted 

            to ISO-LATIN and of course to the "Lower ASCII Control Set" <FF> 
<LF> <CR> <VT> <HT> and the like.

 

            AND

 

            We have to convert  text documents stored on the server containing 
all this (see above) non-valid-XML-characters 

            and also 

            We have to convert valid XML-characters like (ä ö ü è é à etc) into 
UTF8 and reply it to the client.

 

            And of course - we have to do it the other way around when such 
stuff arrives from the client.

 

            How to do that best is my burning question.

 

Josef.Stadelmann

@AXA-winterthur.ch      

 

 

 

Von: Stadelmann Josef [mailto:josef.stadelm...@axa-winterthur.ch] 
Gesendet: Montag, 10. Mai 2010 12:10
An: axis-u...@ws.apache.org
Betreff: how to deal with a Form Feed

 

Hi all,

my legacy server called by an axis2/J web service code passes an returns to 
axis2 a long xml string.

This string is completed in the web service  (Java code) and then converted to 
an AXIOM OMElement type.

When the legacy server returns a <FF>, areal form feed character,  Ascii Code 
12, then something goes wrong and my .NET WCF 3.5 stub returns the error 
message to VB

        "The remote server returns an unexpected response: (400) Bad Request.

However, I know that my legacy server works with ISO-LATIN 8859-1 and that 
Axis2 and/or Java likes UTF-8.

Hence I call the Axis2/C Isolatin2UTF8 routine before I return the string to 
the Java Web Service part.

 that to say; we make sure that only UTF-8 is give as response by the legacy 
server code to the web service

In the java Web Service Part this <FF> char is not converted because there is 
no need to convert it; 

this conforms to the UTF-8 standard which says that any 7 bit character does 
not need to be converted and that such 7 bit values are the same in UTF-8

So any clue what goes wrong?

We use SOAP/XML over a HTTP Transport.

Taking out the <FF> from the text we intend to transfer back to the 
VB/#C/.NET/WCF 3.5 client and all is OK.

Josef

Reply via email to