The article was published shortly after Wired announced its XHTML site last year and I don't have the article anymore. It would appear that they have caught this problem and fixed it now.That's interesting. Wired is also an XHTML site, but it will let the & slip through in news articles as well.I just went through a bunch on Wired, and they are all properly encoded. Which article did you see that let one slip through unencoded?
JTidy can handle this fine. It changes:Think we might need a "Parse as HTML" option, or catch the XML exception and then automatically parse through jtidy?The problem (as I mentioned a few weeks ago) is that there is no solid way to tell that AT&T (specifically the &T portion) isn't the start of an entity, unless we just discard it if it doesn't end in a semicolon, and how far do you go? To the first space, then discard? Up to the next semicolon somewhere (if one appears at all)?
<html>
<head>
<title>Ampersand Test</title>
</head>
<body>
AT&T is a telecommunications company.
</body>
</html>
to:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Ampersand Test</title>
</head>
<body>
AT&T is a telecommunications company.
</body>
</html>
using my jtidy configuration on jEdit.
Ed

