There are a few different things that could be causing this.
One, there is a variable called generate.max.per.host in the
nutch-default.xml file. If this is set to a value instead of -1 then it
will limit the number of urls from that host.
Two, have you set the http.agent.name? If you didn't it probably
wouldn't have fetched anything at all. The job would complete but the
output would be 0.
Three, you could be maxing out your bandwidth and only 1/10th or urls
are actually getting through before timeout or the site is blocking most
of the urls you are trying to fetch through robots.txt. Look at the
JobTracker admin screen for the fetch job and see how many errors are in
each fetch task.
It could also be a url-filter problem with a bad regex filter.
My guess would be from the info you have given that you are maxing your
bandwidth. This would cause the number fetched to fluctuate some but be
about the same. What is your bandwidth for fetching and what do you
have mapred.map.tasks set to and fetcher.threads.fetch set to?
Dennis Kubes
Three,
John Mendenhall wrote:
Hello,
I am running nutch 0.9 currently.
I am running on 4 nodes, one is the
master, in addition to being a slave.
I have injected 100k urls into nutch.
All urls are on the same host.
I am running a generate/fetch/update
cycle with topN set at 100k.
However, after each cycle, it only
fetches between 2588 and 2914 urls
each time. I have run this over 8
times, all with the same result.
I have tried using nutch fetch and
nutch fetch2.
My hypothesis is, this is due to all
urls being on same host (www.example.com/some/path).
Do I need to set the fetcher.threads.per.host
to something higher than the default of 2?
The fetcher.threads.per.host variable is just the number of threads
(fetchers) that can fetch a single host at a given time. If you own/run
the domain it is okay to crawl it faster, if not the default politeness
settings are best as to not overwhelm the server you are crawling.
Is there something in the logs I should
look for to determine the exact cause of
this problem?
Thank you in advance for any assistance
that can be provided.
If you need any additional information,
please let me know and I'll send it.
Thanks!
JohnM