According to Budd, Sinclair:
> using htdig 3.2.0b4-20021110 on Solaris 8
> 
> Dig with the following  bits of config file:
> ............................
> minimum_word_length:    2
> external_protocols:     https
> /home/ppp/htdig-test-3.2.0b4.1110.play/bin/handler.pl  
> allow_in_form:          search_algorithm 
> allow_numbers:          true
> database_dir:         /3.2.0b4/1110/helpdesk
> conf:
> /home/ppp//htdig-test-3.2.0b4.1110/special-runs/helpdesk-dig
> max_hop_count:               14
> check_unique_md5:       true
> 
> external_parsers:      application/pdf->text/html /usr/local/bin/doc2html.pl
> \
>                        application/msword->/text/html
> /usr/local/bin/doc2html.pl 
> 
> #  the following configuration variable will prevent any unaccessed url's
> from being deleted.
> #remove_bad_urls:             false
> remove_bad_urls:             true
> ..................................
> The md5 sums appear in the  vvv listing ok.
> The results of a search may  have  two , sometime three, entries with
> identical  url and extract.
> 
> Can you indicate what is wrong?

If you're getting identical URLs, then this isn't the usual sort
of problem with duplicates, where multiple URLs give the same file.
The only thing I know of that causes duplicates with identical URLs is
searching multiple databases in htsearch (3.2 beta), and the same URL
appears in more than one database.

When I adapted the collections patch for 3.1.6 (in the patch archives),
I added this snippet of code to Display::sort() to remove duplicate URLs,
just after the qsort call, but I don't know if anything of the sort ever
made it into the 3.2.0b4 code, even in recent snapshots.  I also don't
know how easily this code would be to adapt to 3.2 (probably not too
hard), or how well it works (I did test it a bit, but not exhaustively).

    // In case there are duplicate URLs across collections, keep "best" ones
    // after sorting them.
    Dictionary  goturl;
    String      url;
    char        *coded_url;
    int         j = 0;
    for (i = 0; i < numberOfMatches; i++)
    {
        coded_url = array[i]->getURL();
        String url = HtURLCodec::instance()->decode(coded_url);
        HtURLRewriter::instance()->Replace(url);
        if (goturl.Exists(url))
            delete array[i];
        else
        {
            array[j++] = array[i];
            goturl.Add(url, 0);
        }
    }
    numberOfMatches = j;


-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to