I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but
I also know that it is updated from time to time and performs much better
than the other ones that I have tested. Frustratingly, the very first page
I tried to parse failed
(<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister.co.uk/content/54/32593.html).
It seems to be choking on tags that are being written inside of JavaScript
code (i.e. document.write('</scr' + 'ipt>');. Obviously, the simple
solution (that I am using with another parser) is to just ignore everything
inside of <script> tags. It appears that the parser is ignoring text
inside script tags, but it seems like it needs to be a bit smarter (or
maybe dumber) about how it deals with this (so it doesn't get confused by
such occurrences). I see a bug has been filed regarding trouble parsing
JavaScript, has anyone given it thought?
Outside of the HTML parsing, all is well (and outside of a few pages, the
parser is a champ).
Thanks!
-Mike
- Re: HTML Parsing problems... Michael Giles
- Re: HTML Parsing problems... Tatu Saloranta
- Re: HTML Parsing problems... Peter Becker
- Re: HTML Parsing problems... Michael Giles
- Re: HTML Parsing problems... Erik Hatcher
- Re: HTML Parsing problems... Michael Giles
- Re: HTML Parsing problems... Andrzej Bialecki
- Re: HTML Parsing problems... Michael Giles
