----- Original Message ----- From: "Anthony Ettinger" <[EMAIL PROTECTED]>
To: "Bob Stayton" <[EMAIL PROTECTED]>
Cc: "Dave Pawson" <[EMAIL PROTECTED]>; <[email protected]>
Sent: Wednesday, October 31, 2007 1:09 PM
Subject: Re: [docbook] invalid characters for ISO-8859-1 response



Sure, unicode makes sense...I could be missing something but I
would've left entity references alone...I still don't see what is
gained by converting &#140; vs. just leaving it as &#140; in the
output...or simply leaving it as a space.


Ah, now I think I see what you are getting at. If you type &#160; for a non-breaking space, why doesn't it preserve that character as the string "&#160;" in the output? The answer is that the input representation has no direct connection to the output representation.

When an input XML document is parsed into memory, all characters are converted to Unicode internally, regardless of their initial representation. There is no record in the loaded memory that the input was "&#160;", it is all Unicode in memory. After processing in memory, the XML is output using a serializer whose job is to convert the Unicode strings into an output string in some encoding. An encoding has to be chosen, and it is not selected based on the input encoding (which is no longer known to the processor). The default output encoding is UTF-8, but you can specify any of several different encodings for the serializer to use.

That said, one option you might look at is using Saxon instead of libxml2, and use a Saxon extension to control how characters are represented in the output. After all, even if your output encoding is UTF-8, you could still output the six-character string "&#160;" for a non-breaking space instead of the UTF-8 single hex character, and it would still be interpreted as a non-breaking space. Saxon provides that choice. See:

http://www.sagehill.net/docbookxsl/OutputEncoding.html#SaxonCharacter

Bob Stayton
Sagehill Enterprises
DocBook Consulting
[EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to