According to Emma Jane Hogbin: > >I think your assessment of the problem, and proposed solution, are > >both bang-on. The stuff between the <script> and </script> tag should > >be stripped out entirely and not parsed for HTML tags. > > I just found this: > > noindex_start, noindex_end > > type: > string used by: > htdig default: > <!--htdig_noindex--> <!--/htdig_noindex--> description: ... > noindex_start: <SCRIPT > noindex_end: </SCRIPT> > > > Maybe the default could also have the example "<script" tags? > Can you have multiple values for this?
No, right now there is support for only one value in each attribute. We've talked many times of extending it to support multiple values, but so far no one has taken the time to implement it. Ironically, I felt when getting 3.1.6 out that this was less of a priority now that the HTML parser had built-in support for ignoring stuff between <script> and </script> tags. As for the default value, the idea was to have a custom set of tags that only htdig would recognize, and it seems from the e-mails we've seen on the list that these defaults are fairly widely used, so changing the defaults isn't likely to be popular. > >Of course, you can avoid this problem in your HTML if you properly put > >inline JavaScript code inside an HTML comment. E.g.: > > I didn't think I had my JS set up this way but when I went in to check I > actually have the proper comments in there. I thought that the parser read > HTML comments but didn't index them.... and this bug is to do with the > parser getting stuck with stuff it's reading even if it's not indexing it. htdig's HTML parser completely strips out HTML comments, if these are properly formed (i.e. they have an even number of dashes), so if you have valid HTML comment delimiters around your JavaScript, the parser shouldn't be thrown off by any "<" signs in the code. Can you show us an example of the comment delimiters you're using around the JavaScript? ... and in a later message ... > This little bit (from Geoff's old email) is the same as setting > noindex_start and end attributes, right? > > case 29: // "script" > noindex |= TAGscript; > nofollow |= TAGscript; > break; No, these are two very different things. The noindex_start and noindex_end handling is done in the first pass through the HTML, at the same time the comments are stripped out, and the text between these tags is completely removed from the in-memory copy of the document. Setting the noindex and nofollow flags is done in the second pass, and during that pass the parser still looks for tags in the code even when noindex is set, because it may be the matching closing tag that it finds at that point. Geoff's suggestion was to handle <script> and <style> tags in the first pass, rather than in the second. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev