Hello All,

I am using HTDig 3.1.6 on a large web site that has many aliases for pages,
so different URLs point to the same content.  This is causing duplicate
search results since HTDig is using the URL as the unique id.  People are
also not consistent with how they write URLs so
http://www.military.com/spouse and http://www.military.com/spouse/ (note
trailing slash) and these are coming up as different results as well.

I have tried a few different things like search_rewrite_rules (
search_rewrite_rules: http://(.*)/$   http://\\1 ), but the regex was too
greedy and htsearch displayed duplicate results anyway.  My next guess is
url_rewrite_rules, but I am unsure how to write the regexes and if htsearch
will dedupe results with the same URL after rewriting.

How can I get htsearch to rewrite these URLs and dedupe the ones that end up
being the same?  Some of the URLs are very ugly and would require complex
regexes.  If I cannot do it within the HTDIG framework, I may have to htdump
indexes created by htdig, post processing the dumpfiles with a perl script
that munges the URLs as needed and then load and merge the new indexes.  If
that is not possible I may have to munge the search results on the fly and
not display the dupes (ugh!)


Dennis Watson [EMAIL PROTECTED]
UNIX System Administrator Military.com



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
ht://Dig general mailing list: <htdig-general@lists.sourceforge.net>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to