Re: Nutch and distributed searching (w/ apologies)

charlie w Thu, 02 Aug 2007 07:45:56 -0700

Ah, OK, I get it.  Sadly for me, this precise approach is probably not going
meet my requirements, but it really helps to get me going, and I think a
variation on it will suit me quite well.  I'm very much looking forward to
seeing the script that automates this.


I have one minor quibble with this:


> And yes you may have some duplicates in your indexes but this is taken
> care of in the search itself (there is a dedupField option in
> NutchBean).  Of the duplicates the one with the best score (most
> relevant) should be returned.


If you truly have two versions of the same page (same URL), I can imagine a
scenario where you don't necessarily want the one with the highest score.
If the content has changed, you want the one that was most recently
fetched.  You want the best chance of showing an excerpt from the current
page and scoring the current content against other pages that are also hits.

Many thanks for all your help; it clears up a lot for me.

- Charlie

Re: Nutch and distributed searching (w/ apologies)

Reply via email to