Thanks for all input on this one.
The Big Hint was from Eric, being the use of
"http://apache.org/xml/features/scanner/notify-char-refs". What he didn't
(explicitly!) say is that it was also necessary to add the first of the two
lines
xmlReader.setProperty("http://xml.org/sax/properties/lexical-handler", new
mus2HTMLHandler());
xmlReader.setContentHandler(new mus2HTMLHandler());
where xmlReader is an instance of XMLReader, and mus2HTMHandler is now
defined as
class mus2HTMLHandler extends DefaultHandler implements ContentHandler,
LexicalHandler { etc etc
This hint I got from the xerces 2.8.1 samples/sax/DocumentTracer. Doubtless
I really only need one instance of mus2HTMHandler, but I'll clean that up in
the morning!
It turned out that <?xml ... encoding="..."?> doesn't have any bearing on
this at all, tho it certainly does when the emitted HTML hits a browser
;-). I now find that #269 is duly delivered to setEntity just as I want;
I'm sure I'll be able to pass that through to the final HTML construction,
which doesn't come till endDocument, as I do a sort first (surprise,
surprise). Readers may also be interested to know that using the feature
"http://apache.org/xml/features/scanner/notify-builtin-refs" causes all the
various &, &apos and so on to be delivered similarly, tho I'm not using
this.
The above works on Java 1.5, but not on Java 1.4 (Eric's feature
unsupported). I still have to try 1.6.
Tx & rgds to all, Graeme.
----- Original Message -----
From: "Klaus Malorny" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Wednesday, November 15, 2006 5:54 PM
Subject: Re: Entities
[EMAIL PROTECTED] wrote:
> > I removed the "encoding", but am still getting the same result.
(The
> source
> > file is plain old ASCII but also using several of the characters in
the
> > range 128-255. I'm not getting any problem with them.)
>
> Why dont'y you try the encoding apropriate to the characters you use ?
Olek's right. If you have characters above 128, it isn't "plain old
ASCII". In fact, if you have bytes in that range, XML tools (which
generally default to UTF-8) will probably think you're trying to specify
a multibyte character sequence, so you *definitely* need to specify an
encoding.
Real 7-bit ASCII is a proper subset of UTF-8. As soon as you get out of
that range, you need to either use an encoding that the XML parser knows
how to auto-recognize (UTF-8 or UTF-16), or state your encoding
explicitly. Or both.
As far as I have followed the thread, I think Graeme's problem is less a
parsing problem, but is more a problem how to get the U+010D character
back into a "č" when he generates the HTML. Graeme, could you please
describe how you generate the HTML? I assume that you simply emit your
text via an ISO-8859-1 (*) encoding Writer, which converts the
non-ISO-8859-1 character to a question mark. If so, you could replace it
with a Writer that uses UTF-8 and declare the used encoding via a
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
within the <head> section. If you generate your HTML within a JSP page,
you need to use the appropriate methods provided by this platform instead.
Please note that generating HTML (or XML) by hand also requires the proper
handling of the special characters <, & and " (the latter within attribute
values) -- something that many people simply forget.
Klaus
* which is the default encoding on many platforms
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]