Re: [htdig] Masking out template content with [noindex_start] and [noindex_end]

Gilles Detillieux Fri, 04 Jul 2003 08:46:36 -0700

According to [EMAIL PROTECTED]:
> We use ht://Dig as search engine. To make sure that this (to searches
> irrelevant) template content is not indexed we have made heavy use of
> [noindex_start] and [noindex_end] -- in our case we used '<!--
> htdig_noindex_end -->' and '<!-- htdig_noindex_start -->'.
> 
> Now I found out that htdig does not seem to consider these tags as white
> spaces forming separate words.


htdig's HTML parser essentially treats these tags, and everything between
them, as a comment tag, and simply strips it all out from start to end,
inserting nothing in its place.  It does the same with a comment tag,
e.g.:  assessment<!-- this is a comment -->Estrogenic  -->  assesmentEstrogenic

I believe that's standard behaviour for how comments are stripped out,
so it seems logical that by extension the noindex_start and noindex_end
would be handled the same way...  They're more akin to comment tags than
actual HTML tags.  Even with standard HTML tags, not all of them cause
word breaks.

This hasn't been raised as an issue in the past because the way these
tags are usually used is to put them each on their own separate lines.

> I understand that htdig 'completely' ignores everything between the
> respective tags  (http://www.htdig.org/attrs.html#noindex_start) -- still I
> didn't expect this behaviour and I am not sure others do. Is there a work
> around or something I can change in the configuration to make sure that the
> last word before the beginning of the ignored section and the first word of
> the next section are seen/indexed as separate words? Do I have to put a
> (hard) white space before '<!-- htdig_noindex_start -->' (which I'd rather
> not have to do because our internet system has been developed out of house)?

Well, I find it odd that your system would allow you to add tags such as
these but not allow the addition of a space or newline before or after them.
In any case, I can think of a couple workarounds.  One might be to use
<noindex> and </noindex> tags instead, as these tags will trigger a word
break.  The other would be to patch htdig/HTML.cc to add a space after
stripping out the noindex_start ... noindex_end section, thusly...

--- htdig/HTML.cc.orig  2002-01-09 16:12:31.000000000 -0600
+++ htdig/HTML.cc       2003-07-04 10:11:12.000000000 -0500
@@ -191,6 +191,7 @@ HTML::parse(Retriever &retriever, URL &b
            *position = '\0';       // Rest of document will be skipped...
          else
            position = q + skip_end_len;
+         *ptext++ = ' ';
          continue;
        }
 

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] Masking out template content with [noindex_start] and [noindex_end]

Reply via email to