Re: Nutch and distributed searching (w/ apologies)

Dennis Kubes Wed, 01 Aug 2007 13:19:08 -0700

I am currently writing a python script to automate this whole processfrom inject to pushing out to search servers. It should be done in aday or two and I will post it on the wiki.


Dennis Kubes


charlie w wrote:

Thanks very much for the extended reply; lots of food for thought.

WRT the merge/index time on a large index, I kind of suspected this might be
the case.  It's already taking a bit of time (albeit on a weak box) with my
relatively small index.  In general the approach you outline sounds like
something I intuitively thought might need to be done, but had no
real experience to justify that intuition.

So if I understand you correctly, each iteration of fetching winds up on a
separate search server, and you're not doing any merging of segments?

When you eventually get around to recrawling a particular page, do you wind
up with problems if that page exists in two separate indexes on two separate
search servers?  For example, we fetch www.foo.com, and that page goes into
the index on search server 1.  Then, 35 days later, we go back to crawl
www.foo.com, and this time it winds up in the index on search server 2.
Wouldn't the two search servers return the same page as a hit to a search?
If not, what prevents that from being an issue?

You can do a dedeup of results on the search itself. So yes there areduplicates in the different index segments, but you will always bereturning the "best" pages to the user.


It also seems that I must be missing something regarding new pages.  If, as
in step 9, you are replacing the index on a search server, wouldn't you
possibly create the effect of removing documents from the index?  Say you
have the same 2 search servers, but do 10 iterations of fetching as a
"depth" of crawl.  Wouldn't you be replacing the documents in search server
1 several times over the course of those 10 iterations?

No because you are updating a single master crawldb and on the nextiteration it wouldn't grab the same pages, it would grab the next best npages.


Once again, thanks.

- Charlie


On 7/31/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:

It is not a problem to contact me directly if you have questions. I am
going to include this post on the mailing list as well in case other
people have similar questions.

When we originally started (and back when I wrote the tutorial), I
thought the best approache would be to have a single massive segments,
crawldb, linkdb, and indexes on the dfs.  And if we had this then we
would need an index splitter so we split those massive databases to
have x number of urls on each search server.  The problem with this
approach though is that is doesn't scale very well (beyond about 50M
pages).  You have to keep merging whatever you are crawling into your
master and after a while this takes a good deal of time to sort, merge
continually index.

The approach we are using these days is focused on smaller distributed
segments and hence indexes.  Here is how it works:

1) Inject your database with a beginning url list and fetch those pages.
2) Update a single master crawl db (at this point you only have one).
3) Do a generate with a -topN option to get the best urls to fetch.  Do
this for the number of urls you want on each search server.  A good rule
of thumb in no more than 2-3 million pages per disk for searching (this
is for web search engines).  So lets say your crawldb once updated from
the first run has > 2 million urls, you would do a generate with -topN
2000000.
4) Fetch this new segment through the fetch command.
5) Update the single master crawldb with this new segment.
6) Create a single master linkdb (at this point you will only have one)
through the invertlinks command.
7) Index that single fetched segment.
8) Use a script, etc. to push the single index, segments, and linkdb to
a search server directory from the dfs.
9) do steps 3-8 for as many search servers as you have. When you reach
the number of search servers you have you can replace the indexes, etc.
on the first, second, etc. search servers with new fetch cycles.  This
way your index always has the best pages for the number of servers and
amount of space you have.

Once you have a linkdb created, meaning the second or greater fetch,
then you would create a linkdb for just the single segments and then use
the mergelinkdb command to merge the single into the master linkdb.

When pushing the pieces to search servers you can move the entire
linkdb, but after a while that is going to get big.  A better way is to
write a map reduce job that will split the linkdb to only include urls
for the single segment that you have fetched.  Then you would only move
that single linkdb piece out, not the entire master linkdb.  If you want
to get started quick though just copy the entire linkdb to each search
server.

This approach assumes that you have a search website fronting multiple
search servers (search-servers.txt) and that you can bring down a single
search server, update the index and pieces, and then bring the single
search server back up.  This way the entire index is never down.

Hope this helps and let me know if you have any questions.

Dennis Kubes

Re: Nutch and distributed searching (w/ apologies)

Reply via email to