Invalid XML Characters: when valid UTF8 does not mean valid XML
Googling for "valid xml characters" ANSWERED MY QUESTION TO SOME DEGREE. So let me related the problem I am faced with in a attempt that somebody has a clever solution for us for all the folk reading this mails. Our Legacy Server is OpenVMS + PASCAL + C + C++ Code supporting ISO-LATIN-8859-1 Strings are encoded as 8 bit quantities allowing a coverage of the Western European Languages. When we have to pass a string back to the client we need to convert them in the legacy server into UTF8 such that "ä ö ü é è à etc" are converted to the proper UTF8 double byte sequence This works nicely down to the client. But is this enough that your axis2/engine working in conjunction with a) AXOM -> org.apache.axiom.om.impl.builder.StAXOMBuilder and b) HTTP Transport etc. does not raise an exception, an axis fault. NO UTF8 does not mean you have valid XML characters! You should be aware of that! So it is as it is for now - the - assumed-correct-XML string with a FormFeed cariageReturn or LineFeed in it is not a valid-XML-string Conversion from ISO-LATIN-8859-1 to UTF8 is not enough and Conversion must be enhanced by a the proper escaping technique to tranceive valid-XML characters through AXIOM AXUTIL AXIS2 spaces. We have to convert Microsoft Word documents containing word special characters into proper text characters then convert it to UTF8 and send it to the server where they get converted to ISO-LATIN and of course to the "Lower ASCII Control Set" <FF> <LF> <CR> <VT> <HT> and the like. AND We have to convert text documents stored on the server containing all this (see above) non-valid-XML-characters and also We have to convert valid XML-characters like (ä ö ü è é à etc) into UTF8 and reply it to the client. And of course - we have to do it the other way around when such stuff arrives from the client. How to do that best is my burning question. Josef.Stadelmann @AXA-winterthur.ch Von: Stadelmann Josef [mailto:josef.stadelm...@axa-winterthur.ch] Gesendet: Montag, 10. Mai 2010 12:10 An: axis-u...@ws.apache.org Betreff: how to deal with a Form Feed Hi all, my legacy server called by an axis2/J web service code passes an returns to axis2 a long xml string. This string is completed in the web service (Java code) and then converted to an AXIOM OMElement type. When the legacy server returns a <FF>, areal form feed character, Ascii Code 12, then something goes wrong and my .NET WCF 3.5 stub returns the error message to VB "The remote server returns an unexpected response: (400) Bad Request. However, I know that my legacy server works with ISO-LATIN 8859-1 and that Axis2 and/or Java likes UTF-8. Hence I call the Axis2/C Isolatin2UTF8 routine before I return the string to the Java Web Service part. that to say; we make sure that only UTF-8 is give as response by the legacy server code to the web service In the java Web Service Part this <FF> char is not converted because there is no need to convert it; this conforms to the UTF-8 standard which says that any 7 bit character does not need to be converted and that such 7 bit values are the same in UTF-8 So any clue what goes wrong? We use SOAP/XML over a HTTP Transport. Taking out the <FF> from the text we intend to transfer back to the VB/#C/.NET/WCF 3.5 client and all is OK. Josef