Yesterday, I wrote:
> I've pretty much run out of time to help out with this release, but before
> I leave you, I thought I'd submit the following patch for your testing and
> approval.  It should fix the duplicate URL problem in htsearch collections,
> in bug #504087.  I'm not sure what sort of performance impact it will have
> on large databases with lots of potential matches, though.  That may be
> something to consider/test for.
> 
> Cheers,
> Gilles
> 
> --- htsearch/Display.cc.orig  2003-10-25 07:40:23.000000000 -0500
> +++ htsearch/Display.cc       2003-10-27 17:55:52.000000000 -0600
> @@ -1895,6 +1895,27 @@ Display::sort(List *matches)
>      qsort((char *) array, numberOfMatches, sizeof(ResultMatch *),
>         array[0]->getSortFun());
>  
> +    // In case there are duplicate URLs across collections, keep "best" ones
> +    // after sorting them.
> +    Dictionary       goturl;
> +    String   url;
> +    int              j = 0;
> +    for (i = 0; i < numberOfMatches; i++)
> +    {
> +     Collection *collection = array[i]->getCollection();
> +     DocumentRef *ref = collection->getDocumentRef(array[i]->getID());
> +     url = ref->DocURL();
> +     HtURLRewriter::instance()->replace(url);
> +     if (goturl.Exists(url))
> +         delete array[i];
> +     else
> +     {
> +         array[j++] = array[i];
> +         goturl.Add(url, 0);
> +     }
> +    }
> +    numberOfMatches = j;
> +
>      const String st = config->Find("sort");
>      if (!st.empty() && mystrncasecmp("rev", st, 3) == 0)
>      {

I've had just a few more quick thoughts about this patch which I thought
I should share with you folks.

1) I think there should probably be an explicit delete of "ref" and
"collection" at the bottom of the for loop, to prevent a memory leak.
I don't think C++ would do that automatically, would it?

2) This strikes me as a potentially ineffecient approach to this problem,
as it would require a second fetching of every DocumentRef for all
matching documents.  In the 3.1.6 code, the URL is available in the
ResultMatch class, so the collections.0 patch for 3.1.6 was able to do
this much less expensively.  It may be that a better way would be to
catch this in Display::buildMatchList(), as the list of results is being
first built, and to somehow merge the results from the collections such
that the best score for a given URL is retained.  That would be a bit
more complicated than my quick adaptation of the 3.1.6 patch above, and
I don't have the time now to do this properly.  Any other takers?

3) You may want to consider how serious bug #504087 is, and whether it
needs to be fixed before this release.  I don't doubt that this is a bug,
and not just a feature request, but it may not be serious enough to risk
introducing a potentially performance-impacting fix for it this close to
release.  On the other hand, if you can come up with a quick, reliable
and "inexpensive" fix for this, then please do.

Cheers,
Gilles

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?   SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to