Re: Architecture for parallell crawling

ir Wed, 01 Jun 2005 09:51:02 -0700

I'll give this a shot, although I am still kinda new with nutch take
my suggestions with a grain of salt.


First of all your not going to move onto site2 until site1 is finished
crawling... so depending on your depth and the size of the site it
might take forever to just crawl 1 site.  What I have had success
doing is setting up a crawl for just 1 site at a time.  That way I can
have a url filter tailored just for that page, its much easier to keep
strait.

I won't go into detail because I wouldn't not want to provide
incorrect info to the list... but I have multiple instances of "nutch
crawl" running at the same time each with there own set of URL filters
specifically designed for just 1 site.  When the crawls are done I
combine the segments and index all the information.  If you would like
me to go into detail further I can.

This would work well in your situation as you could create 20
instances of nutch and just kick off all 20 crawls at once (if your
machine can handle it).  It would also make each url filter much
easier to develop and test.

You could even write a very simple script that would kick off all the
crawls at once and then combine them into 1 index when you were done.

If you are interested and someone else who knows more then I do say
that this method is acceptable I will explain it in a future mailing.


P.S as far as what architecture "Whole Web" or "Intranet" I would
stick with what you are doing until you understand it better and then
make a decision if the added flex ability of doing "whole web
crawling" is a benefit to you.

Re: Architecture for parallell crawling

Reply via email to