According to me, back in late February...
> According to Frank Richter:
> > Then I had by mistake an empty noindex_start: value in the conf file, oh
> > dear, no words were indexed at all (my error, but might be dangerous for
> > others too).
>
> Yes, you're right. The code should check for an empty string, and disable
> the feature if that's the case. Right now, it just does a strncmp()
> with a length of 0, which will always match. I think this should also
> use mystrncasecmp() instead, and mystrcasestr() to find the end, so that
> it won't care if the tags are upper or lower case. Objections?
Well, I didn't hear any objections, so here's the patch to make these fixes
to htdig/HTML.cc, as well as fix up the discrepancies in the documentation.
I'll be committing these to CVS shortly.
--- ./htdig/HTML.cc.skipendbug Wed Mar 17 16:11:52 1999
+++ ./htdig/HTML.cc Wed Mar 17 17:05:15 1999
@@ -125,9 +125,10 @@
// Filter out section marked to be ignored for indexing.
// This can contain any HTML.
//
- if (strncmp((char *)position, skip_start, strlen(skip_start)) == 0)
+ if (*skip_start &&
+ mystrncasecmp((char *)position, skip_start, strlen(skip_start)) == 0)
{
- q = (unsigned char*)strstr((char *)position, skip_end);
+ q = (unsigned char*)mystrcasestr((char *)position, skip_end);
if (!q)
*position = '\0'; // Rest of document will be skipped...
else
--- ./htdoc/attrs.html.skipendbug Tue Feb 16 23:03:53 1999
+++ ./htdoc/attrs.html Wed Mar 17 16:21:55 1999
@@ -3433,7 +3433,7 @@
<dl>
<dt>
<strong><a name="noindex_start">noindex_start</a>,
- <a name="noindex_stop">noindex_stop</a></strong>
+ <a name="noindex_end">noindex_end</a></strong>
</dt>
<dd>
<dl>
@@ -3453,7 +3453,7 @@
<em>default:</em>
</dt>
<dd>
- <!--htdig-noindex--> <!--/htdig-noindex-->
+ <!--htdig_noindex--> <!--/htdig_noindex-->
</dd>
<dt>
<em>description:</em>
@@ -3468,14 +3468,14 @@
SCRIPT sections in 'uneditable' documents can be skipped; note
how
noindex_start does not contain an ending >: this allows for
all SCRIPT
tags to be matched regardless of attributes defined (different
types or
- languages).
+ languages). Note that the match for this string is case
+insensitive.
</dd>
<dt>
<em>example:</em>
</dt>
<dd>
noindex_start: <SCRIPT<br>
- noindex_stop: </SCRIPT>
+ noindex_end: </SCRIPT>
</dd>
</dl>
</dd>
--- ./htdoc/cf_byname.html.skipendbug Tue Feb 16 23:03:54 1999
+++ ./htdoc/cf_byname.html Wed Mar 17 16:22:47 1999
@@ -105,8 +105,8 @@
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#next_page_text">next_page_text</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#no_excerpt_text">no_excerpt_text</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#no_excerpt_show_top">no_excerpt_show_top</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body"
+href="attrs.html#noindex_end">noindex_end</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#noindex_start">noindex_start</a><br>
- <img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#noindex_stop">noindex_stop</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#no_next_page_text">no_next_page_text</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#no_page_list_header">no_page_list_header</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#no_page_number_text">no_page_number_text</a><br>
--- ./htdoc/cf_byprog.html.skipendbug Tue Feb 16 23:03:54 1999
+++ ./htdoc/cf_byprog.html Wed Mar 17 16:23:10 1999
@@ -56,8 +56,8 @@
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#meta_description_factor">meta_description_factor</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#minimum_word_length">minimum_word_length</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#modification_time_is_now">modification_time_is_now</a><br>
+ <img src="dot.gif" alt="*" width=9 height=9> <a target="body"
+href="attrs.html#noindex_end">noindex_end</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#noindex_start">noindex_start</a><br>
- <img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#noindex_stop">noindex_stop</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#pdf_parser">pdf_parser</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#remove_default_doc">remove_default_doc</a><br>
<img src="dot.gif" alt="*" width=9 height=9> <a target="body"
href="attrs.html#robotstxt_name">robotstxt_name</a><br>
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.