Thank you for the help. that was exactly what I was looking for. mapred.map.tasks and mapred.reduce.tasks reduced load of nodes and now the job successfully completes.
Alexander 2008/11/26 Alexander Aristov <[EMAIL PROTECTED]> > Jobs are started one by one so only one job is running at some moment. > > I already put such params but in the nutch-site.xml file which is > obviousely wrong place for them. My hadoop-site.xml file is empty so default > values are used. > > I will update my config and re-launch crawler. > > > Alexander > > 2008/11/26 Dennis Kubes <[EMAIL PROTECTED]> > >> >> >> Alexander Aristov wrote: >> >>> I run nutch on EC2 small servers, they have about 2Gb RAM. I use DFS. >>> Yes, I >>> supposed tasks, not jobs. Just took the name from the job tracker web >>> page. >>> >> >> Just confirming. If it was starting a bunch of jobs that would be a much >> different error :) >> >> >>> Where should I add these params? In nutch-site.xml or hadoop-site.xml? >>> >> >> Those should go in the hadoop-site.xml file. >> >> >> >>> My logs look so >>> >>> 08/11/25 02:41:53 INFO mapred.JobClient: map 73% reduce 22% >>> 08/11/25 02:41:59 INFO mapred.JobClient: map 73% reduce 23% >>> 08/11/25 02:42:06 INFO mapred.JobClient: map 73% reduce 24% >>> 08/11/25 02:59:02 INFO mapred.JobClient: Task Id : >>> attempt_200811250109_0014_m_000001_0, Status : FAILED >>> Task attempt_200811250109_0014_m_000001_0 failed to report status for 603 >>> seconds. Killing! >>> 08/11/25 02:59:06 INFO mapred.JobClient: Task Id : >>> attempt_200811250109_0014_m_000007_0, Status : FAILED >>> Task attempt_200811250109_0014_m_000007_0 failed to report status for 604 >>> seconds. Killing! >>> 08/11/25 03:01:13 INFO mapred.JobClient: Task Id : >>> attempt_200811250109_0014_m_000000_1, Status : FAILED >>> Task attempt_200811250109_0014_m_000000_1 failed to report status for 604 >>> seconds. Killing! >>> 08/11/25 03:01:43 INFO mapred.JobClient: Task Id : >>> attempt_200811250109_0014_m_000001_1, Status : FAILED >>> Task attempt_200811250109_0014_m_000001_1 failed to report status for 600 >>> seconds. Killing! >>> >>> .... >>> >>> Task attempt_200811250109_0014_m_000019_1 failed to report status for 602 >>> seconds. Killing! >>> 08/11/25 03:37:51 INFO mapred.JobClient: Task Id : >>> attempt_200811250109_0014_m_000021_1, Status : FAILED >>> Task attempt_200811250109_0014_m_000021_1 failed to report status for 600 >>> seconds. Killing! >>> java.io.IOException: Job failed! >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1113) >>> at >>> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:622) >>> at >>> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:667) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>> at >>> >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> at >>> >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at org.apache.hadoop.util.RunJar.main(RunJar.java:155) >>> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >>> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) >>> >> >> What is probably happening is the servers are getting overloaded, swapping >> too much, and not contacting the main namenode / jobtracker in time. What >> number of tasks do you currently have? I think the default is max 2. >> >> Dennis >> >> >> >>> Alexander >>> >>> >>> >>> 2008/11/26 Dennis Kubes <[EMAIL PROTECTED]> >>> >>> The mapred.map.tasks and mapred.reduce.tasks will define the approximate >>>> number of tasks per job. It is highly dependent upon the amount of data >>>> being processed as well. The mapred.tasktracker.map.tasks.maximum and >>>> mapred.tasktracker.reduce.tasks.maximum define the maximum number of >>>> tasks >>>> to run on a single tasktracker for map and reduce tasks. >>>> >>>> When you say 20 jobs I am assuming you mean tasks. Also what type of >>>> hardware are you running this on, what are your memory settings, running >>>> in >>>> local or DFS mode? >>>> >>>> Dennis >>>> >>>> >>>> Alexander Aristov wrote: >>>> >>>> Hi all >>>>> >>>>> Can someone suggest me how to restrict number of jobs Nutch lauches in >>>>> hadoop when starts segment merger. >>>>> >>>>> When I run generate, fetch, updatedb tasks Nutch starts about 6-10 >>>>> Mapreduce >>>>> jobs (cluster of 2 datanodes) - actual value varies from task to task >>>>> but >>>>> when the script start merging segments it lauches about 20 jobs and >>>>> servers >>>>> get overloaded and crash. Nutch settings are primary default one. >>>>> >>>>> How can I control the number of jobs? >>>>> >>>>> best Regards >>>>> Alexander >>>>> >>>>> >>>>> >>>>> >>> >>> > > > -- > Best Regards > Alexander Aristov > -- Best Regards Alexander Aristov
