Yesterday, I wrote: > I've pretty much run out of time to help out with this release, but before > I leave you, I thought I'd submit the following patch for your testing and > approval. It should fix the duplicate URL problem in htsearch collections, > in bug #504087. I'm not sure what sort of performance impact it will have > on large databases with lots of potential matches, though. That may be > something to consider/test for. > > Cheers, > Gilles > > --- htsearch/Display.cc.orig 2003-10-25 07:40:23.000000000 -0500 > +++ htsearch/Display.cc 2003-10-27 17:55:52.000000000 -0600 > @@ -1895,6 +1895,27 @@ Display::sort(List *matches) > qsort((char *) array, numberOfMatches, sizeof(ResultMatch *), > array[0]->getSortFun()); > > + // In case there are duplicate URLs across collections, keep "best" ones > + // after sorting them. > + Dictionary goturl; > + String url; > + int j = 0; > + for (i = 0; i < numberOfMatches; i++) > + { > + Collection *collection = array[i]->getCollection(); > + DocumentRef *ref = collection->getDocumentRef(array[i]->getID()); > + url = ref->DocURL(); > + HtURLRewriter::instance()->replace(url); > + if (goturl.Exists(url)) > + delete array[i]; > + else > + { > + array[j++] = array[i]; > + goturl.Add(url, 0); > + } > + } > + numberOfMatches = j; > + > const String st = config->Find("sort"); > if (!st.empty() && mystrncasecmp("rev", st, 3) == 0) > {
I've had just a few more quick thoughts about this patch which I thought I should share with you folks. 1) I think there should probably be an explicit delete of "ref" and "collection" at the bottom of the for loop, to prevent a memory leak. I don't think C++ would do that automatically, would it? 2) This strikes me as a potentially ineffecient approach to this problem, as it would require a second fetching of every DocumentRef for all matching documents. In the 3.1.6 code, the URL is available in the ResultMatch class, so the collections.0 patch for 3.1.6 was able to do this much less expensively. It may be that a better way would be to catch this in Display::buildMatchList(), as the list of results is being first built, and to somehow merge the results from the collections such that the best score for a given URL is retained. That would be a bit more complicated than my quick adaptation of the 3.1.6 patch above, and I don't have the time now to do this properly. Any other takers? 3) You may want to consider how serious bug #504087 is, and whether it needs to be fixed before this release. I don't doubt that this is a bug, and not just a feature request, but it may not be serious enough to risk introducing a potentially performance-impacting fix for it this close to release. On the other hand, if you can come up with a quick, reliable and "inexpensive" fix for this, then please do. Cheers, Gilles -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.net email is sponsored by: SF.net Giveback Program. Does SourceForge.net help you be more productive? Does it help you create better code? SHARE THE LOVE, and help us help YOU! Click Here: http://sourceforge.net/donate/ _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev