I think we can accelerate this. We index the hostname in the "site" field. When re-querying we could add a clause to the query which prohibits sites we don't want to see any more hits from.
I just committed this. I also re-wrote the support in NutchBean.java. It no longer uses HostHits and HostHit, but rather just Hits and Hit. Each Hit now contains the site name, so that site operations do not require access to the details.
Instead of site grouping, like Google, I opted for site reduction, like Yahoo!. Thus search.jsp now has a query parameter, hitsPerSite, which determines the maximum number of hits displayed for a site. When more than that number are found, each is displayed with a "more from this site" link, but the hits are not re-ordered or grouped by site.
When some hits are not shown, the last page of hits now has, instead of a "next" button, a "show all hits" button that re-queries with site reduction disabled.
Doug
------------------------------------------------------- This SF.Net email is sponsored by OSTG. Have you noticed the changes on Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now, one more big change to announce. We are now OSTG- Open Source Technology Group. Come see the changes on the new OSTG site. www.ostg.com _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
