Jon, First, we need to get rid of this thought >> "I didn't realize that I was stupid until I got to know Nutch"
Gotta keep a positive view, this is not an easy software to learn in a week or so. Now, 1. Threads, That happens by default. It's specified in the conf file -- and the default values are good enough. I would encourage you to read through the nutch-default.xml file as that will give you an overview of all the things available in Nutch. 2. Don't follow external links. Check if you are using the new version of nutch. The older version had a bug where links would get added to the DB without getting filtered. This has since been fixed. I would also urge you to apply Andrzej's fetcher patch. For starters I would recommend not following links and seeing if you can get your initial URL list indexed (all of them to figure out what could be causing the 20 site to not be indexed), then add links back. Take a look at http://www.siteXX.com/robots.txt manually to confirm that you are being blocked from the sites not being indexed. CC- -----Original Message----- From: J B [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 01, 2005 11:59 AM To: [email protected] Subject: Architecture for parallell crawling Hello, Forgive me for my dumb questions, but I couldn't find any guidance in the other postings. I want to crawl about 20 pre-defined (larger) sites, once a day, preferrably in parallell to save time (threads?). Only the pages on those sites should be crawled and not links pointing to other sites. When querying the indexed material, all 20 sources should be searched in the same query. The urls-file looks like this: http://www.site1.com/ http://www.site2.com/ http://www.site3.com/ etc... The file crawl-urlsfilter.txt looks like this: +^http://([a-z0-9]*\.)*site1.com/ +^http://([a-z0-9]*\.)*site2.com/ +^http://([a-z0-9]*\.)*site3.com/ etc... I have tried several different approaches and configurations of these two files, but I never get the desired result. There's always just one crawling process, and it never gets all 20 sites. Moreover, it follows external links to other sites... Given the above, what "Nutch-architecture" should I use? Best regards, Jon "I didn't realize that I was stupid until I got to know Nutch" _________________________________________________________________ L�ttare att hitta dr�mresan med MSN Resor http://www.msn.se/resor/
