Re: Nutch and distributed searching (w/ apologies)

charlie w Wed, 01 Aug 2007 13:53:28 -0700

On 8/1/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:
>
> I am currently writing a python script to automate this whole process
> from inject to pushing out to search servers.  It should be done in a
> day or two and I will post it on the wiki.



I'm very much looking forward to this.  Reading the code always helps make
it concrete to me.

You can do a dedeup of results on the search itself.  So yes there are
> duplicates in the different index segments, but you will always be
> returning the "best" pages to the user.


I get it; so dedup based on the timestamp of each version of the document
with a particular URL that was a hit.

>
> > It also seems that I must be missing something regarding new pages.  If,
> as
> > in step 9, you are replacing the index on a search server, wouldn't you
> > possibly create the effect of removing documents from the index?  Say
> you
> > have the same 2 search servers, but do 10 iterations of fetching as a
> > "depth" of crawl.  Wouldn't you be replacing the documents in search
> server
> > 1 several times over the course of those 10 iterations?
>
> No because you are updating a single master crawldb and on the next
> iteration it wouldn't grab the same pages, it would grab the next best n
> pages.


I had the impression you were overwriting the index on the search servers
with the segment and index from the new iteration of fetching.  Meaning in
my 2 search server example, iteration 3 of fetching would overwrite
the index built by iteration 1 of fetching (they'd both wind up on search
server 1).  But instead, you're actually merging the results of iteration 3
into the search server's existing index from iteration 1, rather than
replacing the entire index?

- C

Re: Nutch and distributed searching (w/ apologies)

Reply via email to