According to [EMAIL PROTECTED]: > We use ht://Dig as search engine. To make sure that this (to searches > irrelevant) template content is not indexed we have made heavy use of > [noindex_start] and [noindex_end] -- in our case we used '<!-- > htdig_noindex_end -->' and '<!-- htdig_noindex_start -->'. > > Now I found out that htdig does not seem to consider these tags as white > spaces forming separate words.
htdig's HTML parser essentially treats these tags, and everything between them, as a comment tag, and simply strips it all out from start to end, inserting nothing in its place. It does the same with a comment tag, e.g.: assessment<!-- this is a comment -->Estrogenic --> assesmentEstrogenic I believe that's standard behaviour for how comments are stripped out, so it seems logical that by extension the noindex_start and noindex_end would be handled the same way... They're more akin to comment tags than actual HTML tags. Even with standard HTML tags, not all of them cause word breaks. This hasn't been raised as an issue in the past because the way these tags are usually used is to put them each on their own separate lines. > I understand that htdig 'completely' ignores everything between the > respective tags (http://www.htdig.org/attrs.html#noindex_start) -- still I > didn't expect this behaviour and I am not sure others do. Is there a work > around or something I can change in the configuration to make sure that the > last word before the beginning of the ignored section and the first word of > the next section are seen/indexed as separate words? Do I have to put a > (hard) white space before '<!-- htdig_noindex_start -->' (which I'd rather > not have to do because our internet system has been developed out of house)? Well, I find it odd that your system would allow you to add tags such as these but not allow the addition of a space or newline before or after them. In any case, I can think of a couple workarounds. One might be to use <noindex> and </noindex> tags instead, as these tags will trigger a word break. The other would be to patch htdig/HTML.cc to add a space after stripping out the noindex_start ... noindex_end section, thusly... --- htdig/HTML.cc.orig 2002-01-09 16:12:31.000000000 -0600 +++ htdig/HTML.cc 2003-07-04 10:11:12.000000000 -0500 @@ -191,6 +191,7 @@ HTML::parse(Retriever &retriever, URL &b *position = '\0'; // Rest of document will be skipped... else position = q + skip_end_len; + *ptext++ = ' '; continue; } -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.Net email sponsored by: Free pre-built ASP.NET sites including Data Reports, E-commerce, Portals, and Forums are available now. Download today and enter to win an XBOX or Visual Studio .NET. http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

