Invalid XML Characters: when valid UTF8 does not mean valid XML
Googling for "valid xml characters" ANSWERED MY QUESTION TO SOME
DEGREE.
So let me related the problem I am faced with in a attempt that
somebody has a clever solution for us for all the folk reading this mails.
Our Legacy Server is OpenVMS + PASCAL + C + C++ Code supporting
ISO-LATIN-8859-1
Strings are encoded as 8 bit quantities allowing a coverage of the
Western European Languages.
When we have to pass a string back to the client we need to convert
them in the legacy server into UTF8 such that "ä ö ü é è à etc" are converted
to the proper UTF8 double byte sequence
This works nicely down to the client.
But is this enough that your axis2/engine working in conjunction
with a) AXOM -> org.apache.axiom.om.impl.builder.StAXOMBuilder and b) HTTP
Transport etc. does not raise an exception, an axis fault.
NO
UTF8 does not mean you have valid XML characters! You should be
aware of that!
So it is as it is for now - the - assumed-correct-XML string with a
FormFeed cariageReturn or LineFeed in it is not a valid-XML-string
Conversion from ISO-LATIN-8859-1 to UTF8 is not enough and
Conversion must be enhanced by a the proper escaping technique to
tranceive valid-XML characters through AXIOM AXUTIL AXIS2 spaces.
We have to convert Microsoft Word documents containing word special
characters into proper text characters then convert it to UTF8 and send it to
the server where they get converted
to ISO-LATIN and of course to the "Lower ASCII Control Set" <FF>
<LF> <CR> <VT> <HT> and the like.
AND
We have to convert text documents stored on the server containing
all this (see above) non-valid-XML-characters
and also
We have to convert valid XML-characters like (ä ö ü è é à etc) into
UTF8 and reply it to the client.
And of course - we have to do it the other way around when such
stuff arrives from the client.
How to do that best is my burning question.
Josef.Stadelmann
@AXA-winterthur.ch
Von: Stadelmann Josef [mailto:[email protected]]
Gesendet: Montag, 10. Mai 2010 12:10
An: [email protected]
Betreff: how to deal with a Form Feed
Hi all,
my legacy server called by an axis2/J web service code passes an returns to
axis2 a long xml string.
This string is completed in the web service (Java code) and then converted to
an AXIOM OMElement type.
When the legacy server returns a <FF>, areal form feed character, Ascii Code
12, then something goes wrong and my .NET WCF 3.5 stub returns the error
message to VB
"The remote server returns an unexpected response: (400) Bad Request.
However, I know that my legacy server works with ISO-LATIN 8859-1 and that
Axis2 and/or Java likes UTF-8.
Hence I call the Axis2/C Isolatin2UTF8 routine before I return the string to
the Java Web Service part.
that to say; we make sure that only UTF-8 is give as response by the legacy
server code to the web service
In the java Web Service Part this <FF> char is not converted because there is
no need to convert it;
this conforms to the UTF-8 standard which says that any 7 bit character does
not need to be converted and that such 7 bit values are the same in UTF-8
So any clue what goes wrong?
We use SOAP/XML over a HTTP Transport.
Taking out the <FF> from the text we intend to transfer back to the
VB/#C/.NET/WCF 3.5 client and all is OK.
Josef