There are a few different things that could be causing this.

One, there is a variable called generate.max.per.host in the nutch-default.xml file. If this is set to a value instead of -1 then it will limit the number of urls from that host.

Two, have you set the http.agent.name? If you didn't it probably wouldn't have fetched anything at all. The job would complete but the output would be 0.

Three, you could be maxing out your bandwidth and only 1/10th or urls are actually getting through before timeout or the site is blocking most of the urls you are trying to fetch through robots.txt. Look at the JobTracker admin screen for the fetch job and see how many errors are in each fetch task.

It could also be a url-filter problem with a bad regex filter.

My guess would be from the info you have given that you are maxing your bandwidth. This would cause the number fetched to fluctuate some but be about the same. What is your bandwidth for fetching and what do you have mapred.map.tasks set to and fetcher.threads.fetch set to?

Dennis Kubes

Three,
John Mendenhall wrote:
Hello,

I am running nutch 0.9 currently.
I am running on 4 nodes, one is the
master, in addition to being a slave.

I have injected 100k urls into nutch.
All urls are on the same host.

I am running a generate/fetch/update
cycle with topN set at 100k.

However, after each cycle, it only
fetches between 2588 and 2914 urls
each time.  I have run this over 8
times, all with the same result.

I have tried using nutch fetch and
nutch fetch2.

My hypothesis is, this is due to all
urls being on same host (www.example.com/some/path).

Do I need to set the fetcher.threads.per.host
to something higher than the default of 2?

The fetcher.threads.per.host variable is just the number of threads (fetchers) that can fetch a single host at a given time. If you own/run the domain it is okay to crawl it faster, if not the default politeness settings are best as to not overwhelm the server you are crawling.


Is there something in the logs I should
look for to determine the exact cause of
this problem?

Thank you in advance for any assistance
that can be provided.

If you need any additional information,
please let me know and I'll send it.

Thanks!

JohnM

Reply via email to