Pierre Belzile wrote:
Hi,
I grabbed a web page from a news web site, ran it through "tidy" to obtain
xhtml and attempted to parse it using SAX2. It throws an exception on
DOCTYPE and (if I remove it) the first " ".
Unless the DTD defines what the entity "nbsp" is, the parser will report an
undefined entity error.
The document DOCTYPE is:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en" xmlns="http://www.w3zor.org/1999/xhtml" xml:lang="en">
This fails to parse in Xercesc. I suspect because the publicId is found but
the systemId is missing. Any way to make this work without editing the doc?
The XML recommendation requires a System ID if a public ID is specified:
http://www.w3.org/TR/REC-xml/#NT-ExternalID
Xerces-C is an XML parser, not an HTML parser, so who knows whether it could
even parse the HTML DTD?
If I remove the DOCTYPE line or choose the DG Scanner, now this fails
because there is no DTD and "nbsp" is never specified as an entity. I
suspect a browser does entity replacement like this automatically. Is there
a good way for me to add standard entities to the grammar (i.e., beyond the
5 basic ones that is already knows about)? Or is it time to switch to
another approach (recommendations?)
Tidy may fix up unbalanced elements, etc., but unless it replaces pre-defined HTML entities with their actual code
points, the parser will report them as undefined entities.
You could always create your own DTD that contains all of the HTML entities.
Dave