Hi Paul,
in the first iteration of the crawl, you get the start-pages. While
parsing the document (done autmatically by "nutch fetch"), the outlinks
are identified. An outlink is every link - including internal links.
With the updatedb, the are injected to the webdb and "nutch generate"
adds them to the partition. Perhaps you can check with "nutch fetchlist".
I'm pretty shure about this part, because I see more than the startpages
:-) I don't know, what "db.ignore.internal.links" is for. I would guess
it's used in the context of the analyze.
The more interesting parameter for you should be
"db.max.outlinks.per.page", because this limits the number of outlinks
used by a page.
Regards
Michael
Paul Williams wrote:
Michael,
Thanks for the reply. I guess what I'm really asking for is how do I
crawl more than just the home page of a site? Looking at
nutch-default.xml there is a property named db.ignore.internal.links, so
do I just say false here and get more in depth searching?
Thanks for an advice.
Paul.
-----Original Message-----
From: Michael Nebel [mailto:[EMAIL PROTECTED]
Sent: 14 September 2005 10:05
To: [email protected]
Subject: Re: Whole web search depth
Hi Paul,
just call the "generate - fetch - updatedb" loop as often as you want.
:-)
Perhaps the parameter "depth" is the wrong name and causes the
confusion. Depth does not mean, that the crawler follows one link to a
depth of x and then takes the next link. Depth does mean the number of
times, the loop "generate - fetch - updatedb" is done. Just take a look
at output of the crawl. The result of calling the loop is (should be)
the same as if you follow one link to the depth of x!
Regards
Michael
Paul Williams wrote:
Hi,
I'm fairly new to using Nutch and so this is probably a newbie
question
(I've already looked in the mailing lists and can't see an answer).
I'm trying to do a web search (limited to around 10 sites at the
moment)
but I'm unsure on how to set the depth of searching. How is this
done?
Cheers.
--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/