That's interesting.  Wired is also an XHTML site, but it will let the &
slip through in news articles as well.
    

	I just went through a bunch on Wired, and they are all properly
encoded. Which article did you see that let one slip through unencoded?
The article was published shortly after Wired announced its XHTML site last year and I don't have the article anymore.  It would appear that they have caught this problem and fixed it now.
Think we might need a "Parse as HTML" option, or catch the XML exception
and then automatically parse through jtidy?
    

	The problem (as I mentioned a few weeks ago) is that there is no
solid way to tell that AT&T (specifically the &T portion) isn't the start of
an entity, unless we just discard it if it doesn't end in a semicolon, and
how far do you go? To the first space, then discard? Up to the next
semicolon somewhere (if one appears at all)?
JTidy can handle this fine.  It changes:

<html>
  <head>
    <title>Ampersand Test</title>
  </head>
  <body>
    AT&T is a telecommunications company.
  </body>
</html>

to:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Ampersand Test</title>
  </head>

  <body>
    AT&amp;T is a telecommunications company.
  </body>
</html>

using my jtidy configuration on jEdit.

Ed





Reply via email to