[EMAIL PROTECTED] wrote:
> > I removed the "encoding", but am still getting the same result. (The
> source
> > file is plain old ASCII but also using several of the characters in the
> > range 128-255. I'm not getting any problem with them.)
>
> Why dont'y you try the encoding apropriate to the characters you use ?
Olek's right. If you have characters above 128, it isn't "plain old
ASCII". In fact, if you have bytes in that range, XML tools (which
generally default to UTF-8) will probably think you're trying to specify
a multibyte character sequence, so you *definitely* need to specify an
encoding.
Real 7-bit ASCII is a proper subset of UTF-8. As soon as you get out of
that range, you need to either use an encoding that the XML parser knows
how to auto-recognize (UTF-8 or UTF-16), or state your encoding
explicitly. Or both.
As far as I have followed the thread, I think Graeme's problem is less a parsing
problem, but is more a problem how to get the U+010D character back into a
"č" when he generates the HTML. Graeme, could you please describe how you
generate the HTML? I assume that you simply emit your text via an ISO-8859-1 (*)
encoding Writer, which converts the non-ISO-8859-1 character to a question mark.
If so, you could replace it with a Writer that uses UTF-8 and declare the used
encoding via a
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
within the <head> section. If you generate your HTML within a JSP page, you need
to use the appropriate methods provided by this platform instead. Please note
that generating HTML (or XML) by hand also requires the proper handling of the
special characters <, & and " (the latter within attribute values) -- something
that many people simply forget.
Klaus
* which is the default encoding on many platforms
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]