According to Budd, Sinclair: > using htdig 3.2.0b4-20021110 on Solaris 8 > > Dig with the following bits of config file: > ............................ > minimum_word_length: 2 > external_protocols: https > /home/ppp/htdig-test-3.2.0b4.1110.play/bin/handler.pl > allow_in_form: search_algorithm > allow_numbers: true > database_dir: /3.2.0b4/1110/helpdesk > conf: > /home/ppp//htdig-test-3.2.0b4.1110/special-runs/helpdesk-dig > max_hop_count: 14 > check_unique_md5: true > > external_parsers: application/pdf->text/html /usr/local/bin/doc2html.pl > \ > application/msword->/text/html > /usr/local/bin/doc2html.pl > > # the following configuration variable will prevent any unaccessed url's > from being deleted. > #remove_bad_urls: false > remove_bad_urls: true > .................................. > The md5 sums appear in the vvv listing ok. > The results of a search may have two , sometime three, entries with > identical url and extract. > > Can you indicate what is wrong?
If you're getting identical URLs, then this isn't the usual sort of problem with duplicates, where multiple URLs give the same file. The only thing I know of that causes duplicates with identical URLs is searching multiple databases in htsearch (3.2 beta), and the same URL appears in more than one database. When I adapted the collections patch for 3.1.6 (in the patch archives), I added this snippet of code to Display::sort() to remove duplicate URLs, just after the qsort call, but I don't know if anything of the sort ever made it into the 3.2.0b4 code, even in recent snapshots. I also don't know how easily this code would be to adapt to 3.2 (probably not too hard), or how well it works (I did test it a bit, but not exhaustively). // In case there are duplicate URLs across collections, keep "best" ones // after sorting them. Dictionary goturl; String url; char *coded_url; int j = 0; for (i = 0; i < numberOfMatches; i++) { coded_url = array[i]->getURL(); String url = HtURLCodec::instance()->decode(coded_url); HtURLRewriter::instance()->Replace(url); if (goturl.Exists(url)) delete array[i]; else { array[j++] = array[i]; goturl.Add(url, 0); } } numberOfMatches = j; -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.NET email is sponsored by: SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See! http://www.vasoftware.com _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html