Take a look at the crawl-delay setting in the robots.txt file on the
website you are attempting to fetch. It may be what is slowing you down.
There is a setting fetcher.max.crawl.delay in your nutch-*.xml file that
can change the behavior for this. The default is 30 seconds meaning
nutch will ignore pages if the crawl delay is over 30 seconds. In the
robots.txt file it would be set in milliseconds, something like 30000.
If that website has crawl delays of say 20000 or 20 seconds nutch would
wait 20 seconds between each webpage request. If this is the case and
the sight has say 10,000 pages then it would take around 2.3 days to fetch.
Dennis Kubes
cesar voulgaris wrote:
OK, thanks
On 2/13/07, cesar voulgaris <[EMAIL PROTECTED]> wrote:
hi, maybe someone who has the same problem can help me:
I started a crawl, at a certain depth the fetchers logs out the urls
aparently correct, but from two days!! it seems to
be fetching the same site (a big one but not so big). What disturbs me is
that the segment directory is always the same size
(du -hs segmentdir) it only has crawl_generate as a subdir. Does nutch
has
a temporary dir, where it stores the fetches until it
write the other subdirs?...maybe it is hangup?. It hapened two times in
diferent crawls (I didi several crawls,not to common)