According to Jaap de Heer:
> I have a little problem excluding JavaScript from ht://dig
> indexes, being that the noindex_start, noindex_end
> attributes are apparently case sensitive.
> So when i set them to <SCRIPT and </SCRIPT>, pages with
> lowercase tags (<script>, </script>) still get indexed -
> including the JavaScript.
> I guess the solution to this could be to either make the
> noindex attributes case insensitive or allow multiple
> exclusions.. could anyone tell me if such is possible?
This was fixed a couple weeks ago in the development snapshots, as
well as a small bug and some documentation errors. This patch, which
I posted back then, should work for 3.1.1. The mailing list archives
at http://www.htdig.org/ are a good source of patches to fix known bugs
and common complaints.
--- ./htdig/HTML.cc.skipendbug Wed Mar 17 16:11:52 1999
+++ ./htdig/HTML.cc Wed Mar 17 17:05:15 1999
@@ -125,9 +125,10 @@
// Filter out section marked to be ignored for indexing.
// This can contain any HTML.
//
- if (strncmp((char *)position, skip_start, strlen(skip_start)) == 0)
+ if (*skip_start &&
+ mystrncasecmp((char *)position, skip_start, strlen(skip_start)) == 0)
{
- q = (unsigned char*)strstr((char *)position, skip_end);
+ q = (unsigned char*)mystrcasestr((char *)position, skip_end);
if (!q)
*position = '\0'; // Rest of document will be skipped...
else
--- ./htdoc/attrs.html.skipendbug Tue Feb 16 23:03:53 1999
+++ ./htdoc/attrs.html Wed Mar 17 16:21:55 1999
@@ -3433,7 +3433,7 @@
<dl>
<dt>
<strong><a name="noindex_start">noindex_start</a>,
- <a name="noindex_stop">noindex_stop</a></strong>
+ <a name="noindex_end">noindex_end</a></strong>
</dt>
<dd>
<dl>
@@ -3453,7 +3453,7 @@
<em>default:</em>
</dt>
<dd>
- <!--htdig-noindex--> <!--/htdig-noindex-->
+ <!--htdig_noindex--> <!--/htdig_noindex-->
</dd>
<dt>
<em>description:</em>
@@ -3468,14 +3468,14 @@
SCRIPT sections in 'uneditable' documents can be skipped; note
how
noindex_start does not contain an ending >: this allows for
all SCRIPT
tags to be matched regardless of attributes defined (different
types or
- languages).
+ languages). Note that the match for this string is case
+insensitive.
</dd>
<dt>
<em>example:</em>
</dt>
<dd>
noindex_start: <SCRIPT<br>
- noindex_stop: </SCRIPT>
+ noindex_end: </SCRIPT>
</dd>
</dl>
</dd>
--- ./htdoc/cf_byname.html.skipendbug Tue Feb 16 23:03:54 1999
+++ ./htdoc/cf_byname.html Wed Mar 17 16:22:47 1999
@@ -105,8 +105,8 @@
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#next_page_text">next_page_text</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#no_excerpt_text">no_excerpt_text</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#no_excerpt_show_top">no_excerpt_show_top</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body"
+href="attrs.html#noindex_end">noindex_end</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#noindex_start">noindex_start</a><br>
- <img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#noindex_stop">noindex_stop</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#no_next_page_text">no_next_page_text</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#no_page_list_header">no_page_list_header</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#no_page_number_text">no_page_number_text</a><br>
--- ./htdoc/cf_byprog.html.skipendbug Tue Feb 16 23:03:54 1999
+++ ./htdoc/cf_byprog.html Wed Mar 17 16:23:10 1999
@@ -56,8 +56,8 @@
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#meta_description_factor">meta_description_factor</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#minimum_word_length">minimum_word_length</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#modification_time_is_now">modification_time_is_now</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body"
+href="attrs.html#noindex_end">noindex_end</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#noindex_start">noindex_start</a><br>
- <img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#noindex_stop">noindex_stop</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#pdf_parser">pdf_parser</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#remove_default_doc">remove_default_doc</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#robotstxt_name">robotstxt_name</a><br>
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.