On 8/1/07, Dennis Kubes <[EMAIL PROTECTED]> wrote: > > I am currently writing a python script to automate this whole process > from inject to pushing out to search servers. It should be done in a > day or two and I will post it on the wiki.
I'm very much looking forward to this. Reading the code always helps make it concrete to me. You can do a dedeup of results on the search itself. So yes there are > duplicates in the different index segments, but you will always be > returning the "best" pages to the user. I get it; so dedup based on the timestamp of each version of the document with a particular URL that was a hit. > > > It also seems that I must be missing something regarding new pages. If, > as > > in step 9, you are replacing the index on a search server, wouldn't you > > possibly create the effect of removing documents from the index? Say > you > > have the same 2 search servers, but do 10 iterations of fetching as a > > "depth" of crawl. Wouldn't you be replacing the documents in search > server > > 1 several times over the course of those 10 iterations? > > No because you are updating a single master crawldb and on the next > iteration it wouldn't grab the same pages, it would grab the next best n > pages. I had the impression you were overwriting the index on the search servers with the segment and index from the new iteration of fetching. Meaning in my 2 search server example, iteration 3 of fetching would overwrite the index built by iteration 1 of fetching (they'd both wind up on search server 1). But instead, you're actually merging the results of iteration 3 into the search server's existing index from iteration 1, rather than replacing the entire index? - C
