Hi Pierre, I used a similar approach to trying to read web pages in using xerces-c. As far as entities are concerned, the only one that gave me a similar problem was and I ended up scanning for that myself and replacing it, perhaps not the most savvy approach but it works fine now. I don't remember having Any problem with the DOCTYPE.
Bill -----Original Message----- From: David Bertoni [mailto:[EMAIL PROTECTED] Sent: Monday, July 03, 2006 11:52 AM To: [email protected] Subject: Re: Handling entities with partial DOCTYPE Pierre Belzile wrote: > Hi, > > I grabbed a web page from a news web site, ran it through "tidy" to obtain > xhtml and attempted to parse it using SAX2. It throws an exception on > DOCTYPE and (if I remove it) the first " ". Unless the DTD defines what the entity "nbsp" is, the parser will report an undefined entity error. > > The document DOCTYPE is: > <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> > <html lang="en" xmlns="http://www.w3zor.org/1999/xhtml" xml:lang="en"> > > This fails to parse in Xercesc. I suspect because the publicId is found but > the systemId is missing. Any way to make this work without editing the doc? The XML recommendation requires a System ID if a public ID is specified: http://www.w3.org/TR/REC-xml/#NT-ExternalID Xerces-C is an XML parser, not an HTML parser, so who knows whether it could even parse the HTML DTD? > > If I remove the DOCTYPE line or choose the DG Scanner, now this fails > because there is no DTD and "nbsp" is never specified as an entity. I > suspect a browser does entity replacement like this automatically. Is there > a good way for me to add standard entities to the grammar (i.e., beyond the > 5 basic ones that is already knows about)? Or is it time to switch to > another approach (recommendations?) Tidy may fix up unbalanced elements, etc., but unless it replaces pre-defined HTML entities with their actual code points, the parser will report them as undefined entities. You could always create your own DTD that contains all of the HTML entities. Dave
