RE: Architecture for parallell crawling

Chirag Chaman Wed, 01 Jun 2005 09:39:51 -0700

Jon,

First, we need to get rid of this thought
>> "I didn't realize that I was stupid until I got to know Nutch"


Gotta keep a positive view, this is not an easy software to learn in a week
or so.

Now, 

1. Threads,
That happens by default. It's specified in the conf file -- and the default
values are good enough. I would encourage you to read through the
nutch-default.xml file as that will give you an overview of all the things
available in Nutch.

2. Don't follow external links.

Check if you are using the new version of nutch. The older version had a bug
where links would get added to the DB without getting filtered. This has
since been fixed. I would also urge you to apply Andrzej's fetcher patch.

For starters I would recommend not following links and seeing if you can get
your initial URL list indexed (all of them to figure out what could be
causing the 20 site to not be indexed), then add links back.  Take a look at
http://www.siteXX.com/robots.txt manually to confirm that you are being
blocked from the sites not being indexed.

CC-




 
 

-----Original Message-----
From: J B [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 01, 2005 11:59 AM
To: [email protected]
Subject: Architecture for parallell crawling

Hello,

Forgive me for my dumb questions, but I couldn't find any guidance in the
other postings.

I want to crawl about 20 pre-defined (larger) sites, once a day, preferrably
in parallell to save time (threads?). Only the pages on those sites should
be crawled and not links pointing to other sites. When querying the indexed
material, all 20 sources should be searched in the same query. The urls-file
looks like this:

http://www.site1.com/
http://www.site2.com/
http://www.site3.com/
etc...

The file crawl-urlsfilter.txt looks like this:

+^http://([a-z0-9]*\.)*site1.com/
+^http://([a-z0-9]*\.)*site2.com/
+^http://([a-z0-9]*\.)*site3.com/
etc...

I have tried several different approaches and configurations of these two
files, but I never get the desired result. There's always just one crawling
process, and it never gets all 20 sites. Moreover, it follows external links
to other sites...

Given the above, what "Nutch-architecture" should I use?

Best regards,

Jon

"I didn't realize that I was stupid until I got to know Nutch"

_________________________________________________________________
L�ttare att hitta dr�mresan med MSN Resor http://www.msn.se/resor/

RE: Architecture for parallell crawling

Reply via email to