Hi,

I grabbed a web page from a news web site, ran it through "tidy" to obtain
xhtml and attempted to parse it using SAX2. It throws an exception on
DOCTYPE and (if I remove it) the first " ".

The document DOCTYPE is:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="en" xmlns="http://www.w3zor.org/1999/xhtml"; xml:lang="en">

This fails to parse in Xercesc. I suspect because the publicId is found but
the systemId is missing. Any way to make this work without editing the doc?

If I remove the DOCTYPE line or choose the DG Scanner, now this fails
because there is no DTD and "nbsp" is never specified as an entity. I
suspect a browser does entity replacement like this automatically. Is there
a good way for me to add standard entities to the grammar (i.e., beyond the
5 basic ones that is already knows about)? Or is it time to switch to
another approach (recommendations?)

Cheers, Pierre

Reply via email to