Re: [htdig-dev] Possible Parser Bug (was Re: [htdig] reading

Gilles Detillieux Mon, 11 Mar 2002 19:23:15 -0800

According to Emma Jane Hogbin:
> >I think your assessment of the problem, and proposed solution, are
> >both bang-on.  The stuff between the <script> and </script> tag should
> >be stripped out entirely and not parsed for HTML tags.
> 
> I just found this:
> 
>   noindex_start, noindex_end
> 
>      type:
>      string used by:
>      htdig default:
>      <!--htdig_noindex--> <!--/htdig_noindex--> description:
...
>          noindex_start: <SCRIPT
>      noindex_end: </SCRIPT>
> 
> 
> Maybe the default could also have the example "<script" tags?
> Can you have multiple values for this?


No, right now there is support for only one value in each attribute.  We've
talked many times of extending it to support multiple values, but so far
no one has taken the time to implement it.  Ironically, I felt when getting
3.1.6 out that this was less of a priority now that the HTML parser had
built-in support for ignoring stuff between <script> and </script> tags.

As for the default value, the idea was to have a custom set of tags that
only htdig would recognize, and it seems from the e-mails we've seen on
the list that these defaults are fairly widely used, so changing the
defaults isn't likely to be popular.

> >Of course, you can avoid this problem in your HTML if you properly put
> >inline JavaScript code inside an HTML comment.  E.g.:
> 
> I didn't think I had my JS set up this way but when I went in to check I 
> actually have the proper comments in there. I thought that the parser read 
> HTML comments but didn't index them.... and this bug is to do with the 
> parser getting stuck with stuff it's reading even if it's not indexing it.

htdig's HTML parser completely strips out HTML comments, if these are
properly formed (i.e. they have an even number of dashes), so if you have
valid HTML comment delimiters around your JavaScript, the parser shouldn't
be thrown off by any "<" signs in the code.  Can you show us an example of
the comment delimiters you're using around the JavaScript?

... and in a later message ...
> This little bit (from Geoff's old email) is the same as setting
> noindex_start and end attributes, right?
> 
> case 29: // "script"
> noindex |= TAGscript;
> nofollow |= TAGscript;
> break;

No, these are two very different things.  The noindex_start and
noindex_end handling is done in the first pass through the HTML, at the
same time the comments are stripped out, and the text between these
tags is completely removed from the in-memory copy of the document.
Setting the noindex and nofollow flags is done in the second pass, and
during that pass the parser still looks for tags in the code even when
noindex is set, because it may be the matching closing tag that it finds
at that point.  Geoff's suggestion was to handle <script> and <style>
tags in the first pass, rather than in the second.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Re: [htdig-dev] Possible Parser Bug (was Re: [htdig] reading

Reply via email to