On 7/8/10, Richard Quadling wrote:
> On 8 July 2010 16:15, Gary wrote:
>> Okay. At least one of the problems with this so called HTML seems to
>> be that the body tag looks like
>> <BODY vlink=#ffffff ...>
>> and xml_parse complains that "> required" on that line (i.e. it is
>> claiming it can't find the end of the tag!).
>> I'm guessing that those attributes "must" be quoted in XML and
>> "should" be in HTML (but patently aren't)? Is there any way to get
>> xml_parse to ignore that? My element_handler functions never even get
>> a chance to see that line.
>> Regex to insert quotes or remove the attributes entirely, perhaps?
>> *gulp* I hope there's a better way than that.
> So. Essentially, you want to parse some plain text which may or may
> not be well formed XML.

No. I don't *want* to.... And it isn't plain text, it's just sh*t html
(no doctype,  missing closing tags in some cases, etc. It's an
absolute mess). Browsers are pretty good at handling it. XML
parsers... less so.

> How badly formed is the file going to be?

It's not a file. It comes from an embedded web server on a device. I
could ask them to change it. I can hear the laughter already.

> If it is things like missing ", then this could be managed with regex.
> Essentially you are going to have to do the clean up that Tidy could
> do for you.

Yeah :(

