Hello,

Forgive me for my dumb questions, but I couldn't find any guidance in the other postings.

I want to crawl about 20 pre-defined (larger) sites, once a day, preferrably in parallell to save time (threads?). Only the pages on those sites should be crawled and not links pointing to other sites. When querying the indexed material, all 20 sources should be searched in the same query. The urls-file looks like this:

http://www.site1.com/
http://www.site2.com/
http://www.site3.com/
etc...

The file crawl-urlsfilter.txt looks like this:

+^http://([a-z0-9]*\.)*site1.com/
+^http://([a-z0-9]*\.)*site2.com/
+^http://([a-z0-9]*\.)*site3.com/
etc...

I have tried several different approaches and configurations of these two files, but I never get the desired result. There's always just one crawling process, and it never gets all 20 sites. Moreover, it follows external links to other sites...

Given the above, what "Nutch-architecture" should I use?

Best regards,

Jon

"I didn't realize that I was stupid until I got to know Nutch"

_________________________________________________________________
L�ttare att hitta dr�mresan med MSN Resor http://www.msn.se/resor/

Reply via email to