> That's interesting. Wired is also an XHTML site, but it will let the &
> slip through in news articles as well.
I just went through a bunch on Wired, and they are all properly
encoded. Which article did you see that let one slip through unencoded?
> Think we might need a "Parse as HTML" option, or catch the XML exception
> and then automatically parse through jtidy?
The problem (as I mentioned a few weeks ago) is that there is no
solid way to tell that AT&T (specifically the &T portion) isn't the start of
an entity, unless we just discard it if it doesn't end in a semicolon, and
how far do you go? To the first space, then discard? Up to the next
semicolon somewhere (if one appears at all)?
At what point do we keep compensating for non-HTML, just to maintain
usability for pages which are clearly broken? It would be simpler to get
them to fix their end, with one email, than to add hacks and workarounds to
the parser(s) to handle it on the client end. Yes, I realize I'm being
pedantic here, but the same goes for things like invalid commenting (which
will break almost any parser), and that horrid pods:// garabge from AvantGo.
d.
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list