Hi Paul,

If we use the whole web concept, it goes like this:

1. Edit your regex-urlfilter.txt to be similar to this one
###################################################
# The default url filter.
# Better for whole-internet crawling.
###################################################
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
# Note I removed this regex since i want to crawl urls with queries
[EMAIL PROTECTED]

# this regex makes sure you get only pages from this site
+^http://www.cnn.com/

# skip everything else - make sure not to get any other sites pages
-.

########## end regex-urlfilter.txt ########################

2. when all setting are in place run inject with the site url and start a sequence of fetch/update/index.

every time you will have more pages being fetched.

Enjoy,

Gal

PS. for initial crawl I used the intranet crawl concept and than i injected my urls file, and the sites rules to my regex-urlfilter.txt


Paul Williams wrote:
Michael,

Thanks for the reply.  I guess what I'm really asking for is how do I
crawl more than just the home page of a site?  Looking at
nutch-default.xml there is a property named db.ignore.internal.links, so
do I just say false here and get more in depth searching?

Thanks for an advice.
Paul.

-----Original Message-----
From: Michael Nebel [mailto:[EMAIL PROTECTED] Sent: 14 September 2005 10:05
To: [email protected]
Subject: Re: Whole web search depth

Hi Paul,

just call the "generate - fetch - updatedb" loop as often as you want.
:-)

Perhaps the parameter "depth" is the wrong name and causes the confusion. Depth does not mean, that the crawler follows one link to a depth of x and then takes the next link. Depth does mean the number of times, the loop "generate - fetch - updatedb" is done. Just take a look at output of the crawl. The result of calling the loop is (should be) the same as if you follow one link to the depth of x!

Regards

        Michael

Paul Williams wrote:

Hi,

I'm fairly new to using Nutch and so this is probably a newbie
question
(I've already looked in the mailing lists and can't see an answer).

I'm trying to do a web search (limited to around 10 sites at the
moment)
but I'm unsure on how to set the depth of searching.  How is this
done?
Cheers.





Reply via email to