Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Dennis Kubes Sun, 20 Jan 2008 06:01:21 -0800


John Mendenhall wrote:

On Sat, 19 Jan 2008, Dennis Kubes wrote:
There are a few different things that could be causing this.
Thanks for the response!
One, there is a variable called generate.max.per.host in thenutch-default.xml file. If this is set to a value instead of -1 then itwill limit the number of urls from that host.
Variable generate.max.per.host is set to -1.
Two, have you set the http.agent.name? If you didn't it probablywouldn't have fetched anything at all. The job would complete but theoutput would be 0.
Variable http.agent.name is set.  Nutch definitely
fetches documents.  No problem there.
Three, you could be maxing out your bandwidth and only 1/10th or urlsare actually getting through before timeout or the site is blocking mostof the urls you are trying to fetch through robots.txt. Look at theJobTracker admin screen for the fetch job and see how many errors are ineach fetch task.
We work with the site, and robots.txt is allowing us
through.  It is definitely getting different pages
each time.  We have 100000 urls in the crawldb.
It is only getting about 3% new pages each generate-
fetch-update cycle.

The most recent completed run had 97 map tasks and
17 reduce tasks, all completed fine, with 0 failures.

Check the number of errors in the fetcher tasks themselves. Iunderstand the task will complete but the fetcher screen should shownumber of fetching errors. My guess is that this is high.


Dennis

It could also be a url-filter problem with a bad regex filter.


I doubt this is a problem.  Each cycle run allows new
urls in.  It just seems limited for each run.

My guess would be from the info you have given that you are maxing yourbandwidth. This would cause the number fetched to fluctuate some but beabout the same. What is your bandwidth for fetching and what do youhave mapred.map.tasks set to and fetcher.threads.fetch set to?


I will have to check on the bandwidth available
for fetching.

Variable mapred.map.tasks is set to 97.
Variable mapred.reduce.tasks is set to 17.

Variable fetcher.threads.fetch is set to 10.

Thanks again for any pointers you can provide.

JohnM

John Mendenhall wrote:

Hello,

I am running nutch 0.9 currently.
I am running on 4 nodes, one is the
master, in addition to being a slave.

I have injected 100k urls into nutch.
All urls are on the same host.

I am running a generate/fetch/update
cycle with topN set at 100k.

However, after each cycle, it only
fetches between 2588 and 2914 urls
each time.  I have run this over 8
times, all with the same result.

I have tried using nutch fetch and
nutch fetch2.

My hypothesis is, this is due to all
urls being on same host (www.example.com/some/path).

Do I need to set the fetcher.threads.per.host
to something higher than the default of 2?

The fetcher.threads.per.host variable is just the number of threads(fetchers) that can fetch a single host at a given time. If you own/runthe domain it is okay to crawl it faster, if not the default politenesssettings are best as to not overwhelm the server you are crawling.

Is there something in the logs I should
look for to determine the exact cause of
this problem?

Thank you in advance for any assistance
that can be provided.

If you need any additional information,
please let me know and I'll send it.

Thanks!

JohnM

Re: nutch 0.9, multiple nodes, not fetching topN links to fetch

Reply via email to