Re: segmentmerger spawns too many jobs

Alexander Aristov Thu, 27 Nov 2008 11:52:55 -0800

Thank you for the help. that was exactly what I was looking
for.  mapred.map.tasks and mapred.reduce.tasks reduced load of nodes and now
the job successfully completes.


Alexander

2008/11/26 Alexander Aristov <[EMAIL PROTECTED]>

> Jobs are started one by one so only one job is running at some moment.
>
> I already put such params but in the nutch-site.xml file which is
> obviousely wrong place for them. My hadoop-site.xml file is empty so default
> values are used.
>
> I will update my config and re-launch crawler.
>
>
> Alexander
>
> 2008/11/26 Dennis Kubes <[EMAIL PROTECTED]>
>
>>
>>
>> Alexander Aristov wrote:
>>
>>> I run nutch on EC2 small servers, they have about 2Gb RAM. I use DFS.
>>> Yes, I
>>> supposed tasks, not jobs. Just took the name from the job tracker web
>>> page.
>>>
>>
>> Just confirming.  If it was starting a bunch of jobs that would be a much
>> different error :)
>>
>>
>>> Where should I add these params? In nutch-site.xml or hadoop-site.xml?
>>>
>>
>> Those should go in the hadoop-site.xml file.
>>
>>
>>
>>> My logs look so
>>>
>>> 08/11/25 02:41:53 INFO mapred.JobClient:  map 73% reduce 22%
>>> 08/11/25 02:41:59 INFO mapred.JobClient:  map 73% reduce 23%
>>> 08/11/25 02:42:06 INFO mapred.JobClient:  map 73% reduce 24%
>>> 08/11/25 02:59:02 INFO mapred.JobClient: Task Id :
>>> attempt_200811250109_0014_m_000001_0, Status : FAILED
>>> Task attempt_200811250109_0014_m_000001_0 failed to report status for 603
>>> seconds. Killing!
>>> 08/11/25 02:59:06 INFO mapred.JobClient: Task Id :
>>> attempt_200811250109_0014_m_000007_0, Status : FAILED
>>> Task attempt_200811250109_0014_m_000007_0 failed to report status for 604
>>> seconds. Killing!
>>> 08/11/25 03:01:13 INFO mapred.JobClient: Task Id :
>>> attempt_200811250109_0014_m_000000_1, Status : FAILED
>>> Task attempt_200811250109_0014_m_000000_1 failed to report status for 604
>>> seconds. Killing!
>>> 08/11/25 03:01:43 INFO mapred.JobClient: Task Id :
>>> attempt_200811250109_0014_m_000001_1, Status : FAILED
>>> Task attempt_200811250109_0014_m_000001_1 failed to report status for 600
>>> seconds. Killing!
>>>
>>> ....
>>>
>>> Task attempt_200811250109_0014_m_000019_1 failed to report status for 602
>>> seconds. Killing!
>>> 08/11/25 03:37:51 INFO mapred.JobClient: Task Id :
>>> attempt_200811250109_0014_m_000021_1, Status : FAILED
>>> Task attempt_200811250109_0014_m_000021_1 failed to report status for 600
>>> seconds. Killing!
>>> java.io.IOException: Job failed!
>>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1113)
>>>        at
>>> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:622)
>>>        at
>>> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:667)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>        at
>>>
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>        at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>>>        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>>        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>>
>>
>> What is probably happening is the servers are getting overloaded, swapping
>> too much, and not contacting the main namenode / jobtracker in time.  What
>> number of tasks do you currently have? I think the default is max 2.
>>
>> Dennis
>>
>>
>>
>>> Alexander
>>>
>>>
>>>
>>> 2008/11/26 Dennis Kubes <[EMAIL PROTECTED]>
>>>
>>>  The mapred.map.tasks and mapred.reduce.tasks will define the approximate
>>>> number of tasks per job.  It is highly dependent upon the amount of data
>>>> being processed as well.  The mapred.tasktracker.map.tasks.maximum and
>>>> mapred.tasktracker.reduce.tasks.maximum define the maximum number of
>>>> tasks
>>>> to run on a single tasktracker for map and reduce tasks.
>>>>
>>>> When you say 20 jobs I am assuming you mean tasks.  Also what type of
>>>> hardware are you running this on, what are your memory settings, running
>>>> in
>>>> local or DFS mode?
>>>>
>>>> Dennis
>>>>
>>>>
>>>> Alexander Aristov wrote:
>>>>
>>>>  Hi all
>>>>>
>>>>> Can someone suggest me how to restrict number of jobs Nutch lauches in
>>>>> hadoop when starts segment merger.
>>>>>
>>>>> When I run generate, fetch, updatedb tasks Nutch starts about 6-10
>>>>> Mapreduce
>>>>> jobs (cluster of 2 datanodes) - actual value varies from task to task
>>>>> but
>>>>> when the script start merging segments it lauches about 20 jobs and
>>>>> servers
>>>>> get overloaded and crash. Nutch settings are primary default one.
>>>>>
>>>>> How can I control the number of jobs?
>>>>>
>>>>> best Regards
>>>>> Alexander
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>
>
> --
> Best Regards
> Alexander Aristov
>



-- 
Best Regards
Alexander Aristov

Re: segmentmerger spawns too many jobs

Reply via email to