----- Original Message -----
From: "Anthony Ettinger" <[EMAIL PROTECTED]>
To: "Bob Stayton" <[EMAIL PROTECTED]>
Cc: "Dave Pawson" <[EMAIL PROTECTED]>; <[email protected]>
Sent: Wednesday, October 31, 2007 1:09 PM
Subject: Re: [docbook] invalid characters for ISO-8859-1 response
Sure, unicode makes sense...I could be missing something but I
would've left entity references alone...I still don't see what is
gained by converting Œ vs. just leaving it as Œ in the
output...or simply leaving it as a space.
Ah, now I think I see what you are getting at. If you type   for a
non-breaking space, why doesn't it preserve that character as the string
" " in the output? The answer is that the input representation has no
direct connection to the output representation.
When an input XML document is parsed into memory, all characters are
converted to Unicode internally, regardless of their initial
representation. There is no record in the loaded memory that the input was
" ", it is all Unicode in memory. After processing in memory, the XML
is output using a serializer whose job is to convert the Unicode strings
into an output string in some encoding. An encoding has to be chosen, and
it is not selected based on the input encoding (which is no longer known to
the processor). The default output encoding is UTF-8, but you can specify
any of several different encodings for the serializer to use.
That said, one option you might look at is using Saxon instead of libxml2,
and use a Saxon extension to control how characters are represented in the
output. After all, even if your output encoding is UTF-8, you could still
output the six-character string " " for a non-breaking space instead
of the UTF-8 single hex character, and it would still be interpreted as a
non-breaking space. Saxon provides that choice. See:
http://www.sagehill.net/docbookxsl/OutputEncoding.html#SaxonCharacter
Bob Stayton
Sagehill Enterprises
DocBook Consulting
[EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]