According to Patrice BAUMANN:
> A document can appear several times in a search result : how can I avoid
> this ?
This simple question is actually a complicated one. htdig does keep
track of the URLs it visits, so it never puts the same URL more than once
in the database. So, if you have duplicate documents in your search
results, it's because the same document appears under different URLs.
Sometimes the URLs vary only slightly, and in subtle ways, so you may
have to look hard to find out what the variation is. Here are some
common reasons, each requiring a different solution.
1) You're indexing a case insensitive web server (e.g. an NT based
server), but the case_sensitive attribute is still set to true. In this
case, if htdig encounters two URLs pointing to the same document, but
the case of the letters in one is different than the other (even if it's
only 1 letter), it will not treat them as the same URL.
2) You have symbolic links (or hard links) to some of these documents,
so they can be reached by several URLs. The solution here is to build
an exclude list of URLs that are actually symbolic links, and putting
these in exclude_urls (or in your robots.txt file). You can automate
this using a technique similar to the find command in FAQ 5.25 which
builds the start_url list, but adding a -type l to find symbolic links.
3) You have copies of the same documents in different locations.
This is similar to the symbolic link problem above, but harder to fix
automatically.
4) The duplicate URLs result from CGI or SSI pages that give the same
content even though there may be variations in the query string or other
parts of the URL. The approach to fix this is similar to the fix above,
but may be less easy to automate, depending on what the variations are.
You can add patterns to exclude_urls or bad_querystr to get rid of
unwanted variations. These are especially important to bring under
control, because in some cases, if left unchecked, they can result in an
"infinite virtual hierarchy" which htdig will never be able to finish
indexing. E.g.: imagine a CGI-based calendar, where htdig could go on
following next month or next year links to infinity, but which can be
stopped by adding a stop year to bad_querystr.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html