Hi, I grabbed a web page from a news web site, ran it through "tidy" to obtain xhtml and attempted to parse it using SAX2. It throws an exception on DOCTYPE and (if I remove it) the first " ".
The document DOCTYPE is: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html lang="en" xmlns="http://www.w3zor.org/1999/xhtml" xml:lang="en"> This fails to parse in Xercesc. I suspect because the publicId is found but the systemId is missing. Any way to make this work without editing the doc? If I remove the DOCTYPE line or choose the DG Scanner, now this fails because there is no DTD and "nbsp" is never specified as an entity. I suspect a browser does entity replacement like this automatically. Is there a good way for me to add standard entities to the grammar (i.e., beyond the 5 basic ones that is already knows about)? Or is it time to switch to another approach (recommendations?) Cheers, Pierre
