Jobs are started one by one so only one job is running at some moment. I already put such params but in the nutch-site.xml file which is obviousely wrong place for them. My hadoop-site.xml file is empty so default values are used.
I will update my config and re-launch crawler. Alexander 2008/11/26 Dennis Kubes <[EMAIL PROTECTED]> > > > Alexander Aristov wrote: > >> I run nutch on EC2 small servers, they have about 2Gb RAM. I use DFS. Yes, >> I >> supposed tasks, not jobs. Just took the name from the job tracker web >> page. >> > > Just confirming. If it was starting a bunch of jobs that would be a much > different error :) > > >> Where should I add these params? In nutch-site.xml or hadoop-site.xml? >> > > Those should go in the hadoop-site.xml file. > > > >> My logs look so >> >> 08/11/25 02:41:53 INFO mapred.JobClient: map 73% reduce 22% >> 08/11/25 02:41:59 INFO mapred.JobClient: map 73% reduce 23% >> 08/11/25 02:42:06 INFO mapred.JobClient: map 73% reduce 24% >> 08/11/25 02:59:02 INFO mapred.JobClient: Task Id : >> attempt_200811250109_0014_m_000001_0, Status : FAILED >> Task attempt_200811250109_0014_m_000001_0 failed to report status for 603 >> seconds. Killing! >> 08/11/25 02:59:06 INFO mapred.JobClient: Task Id : >> attempt_200811250109_0014_m_000007_0, Status : FAILED >> Task attempt_200811250109_0014_m_000007_0 failed to report status for 604 >> seconds. Killing! >> 08/11/25 03:01:13 INFO mapred.JobClient: Task Id : >> attempt_200811250109_0014_m_000000_1, Status : FAILED >> Task attempt_200811250109_0014_m_000000_1 failed to report status for 604 >> seconds. Killing! >> 08/11/25 03:01:43 INFO mapred.JobClient: Task Id : >> attempt_200811250109_0014_m_000001_1, Status : FAILED >> Task attempt_200811250109_0014_m_000001_1 failed to report status for 600 >> seconds. Killing! >> >> .... >> >> Task attempt_200811250109_0014_m_000019_1 failed to report status for 602 >> seconds. Killing! >> 08/11/25 03:37:51 INFO mapred.JobClient: Task Id : >> attempt_200811250109_0014_m_000021_1, Status : FAILED >> Task attempt_200811250109_0014_m_000021_1 failed to report status for 600 >> seconds. Killing! >> java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1113) >> at >> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:622) >> at >> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:667) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:155) >> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) >> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) >> > > What is probably happening is the servers are getting overloaded, swapping > too much, and not contacting the main namenode / jobtracker in time. What > number of tasks do you currently have? I think the default is max 2. > > Dennis > > > >> Alexander >> >> >> >> 2008/11/26 Dennis Kubes <[EMAIL PROTECTED]> >> >> The mapred.map.tasks and mapred.reduce.tasks will define the approximate >>> number of tasks per job. It is highly dependent upon the amount of data >>> being processed as well. The mapred.tasktracker.map.tasks.maximum and >>> mapred.tasktracker.reduce.tasks.maximum define the maximum number of >>> tasks >>> to run on a single tasktracker for map and reduce tasks. >>> >>> When you say 20 jobs I am assuming you mean tasks. Also what type of >>> hardware are you running this on, what are your memory settings, running >>> in >>> local or DFS mode? >>> >>> Dennis >>> >>> >>> Alexander Aristov wrote: >>> >>> Hi all >>>> >>>> Can someone suggest me how to restrict number of jobs Nutch lauches in >>>> hadoop when starts segment merger. >>>> >>>> When I run generate, fetch, updatedb tasks Nutch starts about 6-10 >>>> Mapreduce >>>> jobs (cluster of 2 datanodes) - actual value varies from task to task >>>> but >>>> when the script start merging segments it lauches about 20 jobs and >>>> servers >>>> get overloaded and crash. Nutch settings are primary default one. >>>> >>>> How can I control the number of jobs? >>>> >>>> best Regards >>>> Alexander >>>> >>>> >>>> >>>> >> >> -- Best Regards Alexander Aristov
