Hello,
Forgive me for my dumb questions, but I couldn't find any guidance in the
other postings.
I want to crawl about 20 pre-defined (larger) sites, once a day, preferrably
in parallell to save time (threads?). Only the pages on those sites should
be crawled and not links pointing to other sites. When querying the indexed
material, all 20 sources should be searched in the same query. The urls-file
looks like this:
http://www.site1.com/
http://www.site2.com/
http://www.site3.com/
etc...
The file crawl-urlsfilter.txt looks like this:
+^http://([a-z0-9]*\.)*site1.com/
+^http://([a-z0-9]*\.)*site2.com/
+^http://([a-z0-9]*\.)*site3.com/
etc...
I have tried several different approaches and configurations of these two
files, but I never get the desired result. There's always just one crawling
process, and it never gets all 20 sites. Moreover, it follows external links
to other sites...
Given the above, what "Nutch-architecture" should I use?
Best regards,
Jon
"I didn't realize that I was stupid until I got to know Nutch"
_________________________________________________________________
L�ttare att hitta dr�mresan med MSN Resor http://www.msn.se/resor/