I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but I also know that it is updated from time to time and performs much better than the other ones that I have tested. Frustratingly, the very first page I tried to parse failed (<http://www.theregister.co.uk/content/54/32593.html>http://www.theregister.co.uk/content/54/32593.html). It seems to be choking on tags that are being written inside of JavaScript code (i.e. document.write('</scr' + 'ipt>');. Obviously, the simple solution (that I am using with another parser) is to just ignore everything inside of <script> tags. It appears that the parser is ignoring text inside script tags, but it seems like it needs to be a bit smarter (or maybe dumber) about how it deals with this (so it doesn't get confused by such occurrences). I see a bug has been filed regarding trouble parsing JavaScript, has anyone given it thought?

Outside of the HTML parsing, all is well (and outside of a few pages, the parser is a champ).

Thanks!
-Mike

Reply via email to