Architecture for parallell crawling

J B Wed, 01 Jun 2005 08:59:01 -0700

Hello,

Forgive me for my dumb questions, but I couldn't find any guidance in theother postings.

I want to crawl about 20 pre-defined (larger) sites, once a day, preferrablyin parallell to save time (threads?). Only the pages on those sites shouldbe crawled and not links pointing to other sites. When querying the indexedmaterial, all 20 sources should be searched in the same query. The urls-filelooks like this:


http://www.site1.com/
http://www.site2.com/
http://www.site3.com/
etc...

The file crawl-urlsfilter.txt looks like this:

+^http://([a-z0-9]*\.)*site1.com/
+^http://([a-z0-9]*\.)*site2.com/
+^http://([a-z0-9]*\.)*site3.com/
etc...

I have tried several different approaches and configurations of these twofiles, but I never get the desired result. There's always just one crawlingprocess, and it never gets all 20 sites. Moreover, it follows external linksto other sites...


Given the above, what "Nutch-architecture" should I use?

Best regards,

Jon

"I didn't realize that I was stupid until I got to know Nutch"

_________________________________________________________________
L�ttare att hitta dr�mresan med MSN Resor http://www.msn.se/resor/

Architecture for parallell crawling

Reply via email to