[Nutch-general] misconfigured http.robots.agents (was Re: mapred: config parameters)

Michael Nebel Wed, 01 Feb 2006 02:01:54 -0800

Hi,

as I expected: the error sat in front of my computer. :-(

I changed the http.agent.name and added it to the http.robots.agents. Sofar so good, but my mistake was: I added the new name not at the firstposition. Finally I was bothered by the SEVERE-Error in thetaskmanager-log. After fixing this problem - everything works really fine!

Lesson learned: if the developer throws a severe error: don't ignore it- fix it!


Regards

        Michael



Gal Nitzan wrote:

Hi Michael,

this question should be asked in the nutch-users list.

Take a look at a thread: So many Unfetched Pages using MapReduce

G.

On Tue, 2006-01-31 at 15:52 +0100, Michael Nebel wrote:
Hi,

the last days I gave the mapred-branch a try and I was impressed!
But I still have a problem with the incremental crawling. My setup: Ihave 4 boxes (1x namenode/jobtracker - 3x datanode/tasktracker). Runningone round of "crawling" consists out of the steps:
- generate (I set a limit of "-topN 10000000")
- fetch
- update
- index
- invertlinks
For the first round, I injected a list of about 20.000 websites. Whenrunning nutch, I expected, that the fetcher would be pretty busy andwent for a coffee. Ok: perhaps someone talked to wife and decided, Ishould not drink so much coffee. But I think, I made a mistake.. Butafter 100 URLs he stopped working.
After some tweaking I got the installation to fetch about 10.000 pages,but this is still not what I expect. First guess was the url-filter, butI see the urls in the tasktracker log. I looked at the mailinglist andgot many ideas, but I still get more confused.
I think, the following parameters have an influence on the number ofpages fetched (in the brackets are the values I selected):
- mapred.map.tasks                      (100)
- mapred.reduce.tasks                   (3)
- mapred.task.timeout                   (3600000 [an other question])
- mapred.tasktracker.tasks.maximum      (10)
- fetcher.threads.fetch                 (100)
- fetcher.server.delay                  (5.0)
- fetcher.threads.per.host              (10)
- generate.max.per.host                 (1000)
- http.content.limit                    (2000000)
I don't like my parameters, but so I got the most results. Looking atthe jobmanager, each "map task" fetched between 70 - 100 pages. Having100 map.tasks: I have ~ 8000 new pages fetched in the end. That's nearlythe number the crawldb says too.
Which parameter has an influence on the number of pages ONE taskfetches. By my observations, I would guess it's "fetcher.threads.fetch".Increasing this number further means, to blast the load on thetasktrackers. So there must be an other problem.
Any help appreciated!

Regards

        Michael



--
Michael Nebel
http://www.nebel.de/
http://www.netluchs.de/


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] misconfigured http.robots.agents (was Re: mapred: config parameters)

Reply via email to