John Mendenhall wrote:
On Sat, 19 Jan 2008, Dennis Kubes wrote:
There are a few different things that could be causing this.
Thanks for the response!
One, there is a variable called generate.max.per.host in the
nutch-default.xml file. If this is set to a value instead of -1 then it
will limit the number of urls from that host.
Variable generate.max.per.host is set to -1.
Two, have you set the http.agent.name? If you didn't it probably
wouldn't have fetched anything at all. The job would complete but the
output would be 0.
Variable http.agent.name is set. Nutch definitely
fetches documents. No problem there.
Three, you could be maxing out your bandwidth and only 1/10th or urls
are actually getting through before timeout or the site is blocking most
of the urls you are trying to fetch through robots.txt. Look at the
JobTracker admin screen for the fetch job and see how many errors are in
each fetch task.
We work with the site, and robots.txt is allowing us
through. It is definitely getting different pages
each time. We have 100000 urls in the crawldb.
It is only getting about 3% new pages each generate-
fetch-update cycle.
The most recent completed run had 97 map tasks and
17 reduce tasks, all completed fine, with 0 failures.
Check the number of errors in the fetcher tasks themselves. I
understand the task will complete but the fetcher screen should show
number of fetching errors. My guess is that this is high.
Dennis
It could also be a url-filter problem with a bad regex filter.
I doubt this is a problem. Each cycle run allows new
urls in. It just seems limited for each run.
My guess would be from the info you have given that you are maxing your
bandwidth. This would cause the number fetched to fluctuate some but be
about the same. What is your bandwidth for fetching and what do you
have mapred.map.tasks set to and fetcher.threads.fetch set to?
I will have to check on the bandwidth available
for fetching.
Variable mapred.map.tasks is set to 97.
Variable mapred.reduce.tasks is set to 17.
Variable fetcher.threads.fetch is set to 10.
Thanks again for any pointers you can provide.
JohnM
John Mendenhall wrote:
Hello,
I am running nutch 0.9 currently.
I am running on 4 nodes, one is the
master, in addition to being a slave.
I have injected 100k urls into nutch.
All urls are on the same host.
I am running a generate/fetch/update
cycle with topN set at 100k.
However, after each cycle, it only
fetches between 2588 and 2914 urls
each time. I have run this over 8
times, all with the same result.
I have tried using nutch fetch and
nutch fetch2.
My hypothesis is, this is due to all
urls being on same host (www.example.com/some/path).
Do I need to set the fetcher.threads.per.host
to something higher than the default of 2?
The fetcher.threads.per.host variable is just the number of threads
(fetchers) that can fetch a single host at a given time. If you own/run
the domain it is okay to crawl it faster, if not the default politeness
settings are best as to not overwhelm the server you are crawling.
Is there something in the logs I should
look for to determine the exact cause of
this problem?
Thank you in advance for any assistance
that can be provided.
If you need any additional information,
please let me know and I'll send it.
Thanks!
JohnM