Ah, OK, I get it. Sadly for me, this precise approach is probably not going meet my requirements, but it really helps to get me going, and I think a variation on it will suit me quite well. I'm very much looking forward to seeing the script that automates this.
I have one minor quibble with this: > And yes you may have some duplicates in your indexes but this is taken > care of in the search itself (there is a dedupField option in > NutchBean). Of the duplicates the one with the best score (most > relevant) should be returned. If you truly have two versions of the same page (same URL), I can imagine a scenario where you don't necessarily want the one with the highest score. If the content has changed, you want the one that was most recently fetched. You want the best chance of showing an excerpt from the current page and scoring the current content against other pages that are also hits. Many thanks for all your help; it clears up a lot for me. - Charlie
