Re: segmentmerger spawns too many jobs

Alexander Aristov Wed, 26 Nov 2008 07:14:52 -0800

Jobs are started one by one so only one job is running at some moment.

I already put such params but in the nutch-site.xml file which is obviousely
wrong place for them. My hadoop-site.xml file is empty so default values are
used.


I will update my config and re-launch crawler.

Alexander

2008/11/26 Dennis Kubes <[EMAIL PROTECTED]>

>
>
> Alexander Aristov wrote:
>
>> I run nutch on EC2 small servers, they have about 2Gb RAM. I use DFS. Yes,
>> I
>> supposed tasks, not jobs. Just took the name from the job tracker web
>> page.
>>
>
> Just confirming.  If it was starting a bunch of jobs that would be a much
> different error :)
>
>
>> Where should I add these params? In nutch-site.xml or hadoop-site.xml?
>>
>
> Those should go in the hadoop-site.xml file.
>
>
>
>> My logs look so
>>
>> 08/11/25 02:41:53 INFO mapred.JobClient:  map 73% reduce 22%
>> 08/11/25 02:41:59 INFO mapred.JobClient:  map 73% reduce 23%
>> 08/11/25 02:42:06 INFO mapred.JobClient:  map 73% reduce 24%
>> 08/11/25 02:59:02 INFO mapred.JobClient: Task Id :
>> attempt_200811250109_0014_m_000001_0, Status : FAILED
>> Task attempt_200811250109_0014_m_000001_0 failed to report status for 603
>> seconds. Killing!
>> 08/11/25 02:59:06 INFO mapred.JobClient: Task Id :
>> attempt_200811250109_0014_m_000007_0, Status : FAILED
>> Task attempt_200811250109_0014_m_000007_0 failed to report status for 604
>> seconds. Killing!
>> 08/11/25 03:01:13 INFO mapred.JobClient: Task Id :
>> attempt_200811250109_0014_m_000000_1, Status : FAILED
>> Task attempt_200811250109_0014_m_000000_1 failed to report status for 604
>> seconds. Killing!
>> 08/11/25 03:01:43 INFO mapred.JobClient: Task Id :
>> attempt_200811250109_0014_m_000001_1, Status : FAILED
>> Task attempt_200811250109_0014_m_000001_1 failed to report status for 600
>> seconds. Killing!
>>
>> ....
>>
>> Task attempt_200811250109_0014_m_000019_1 failed to report status for 602
>> seconds. Killing!
>> 08/11/25 03:37:51 INFO mapred.JobClient: Task Id :
>> attempt_200811250109_0014_m_000021_1, Status : FAILED
>> Task attempt_200811250109_0014_m_000021_1 failed to report status for 600
>> seconds. Killing!
>> java.io.IOException: Job failed!
>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1113)
>>        at
>> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:622)
>>        at
>> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:667)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>>        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>>        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>
>
> What is probably happening is the servers are getting overloaded, swapping
> too much, and not contacting the main namenode / jobtracker in time.  What
> number of tasks do you currently have? I think the default is max 2.
>
> Dennis
>
>
>
>> Alexander
>>
>>
>>
>> 2008/11/26 Dennis Kubes <[EMAIL PROTECTED]>
>>
>>  The mapred.map.tasks and mapred.reduce.tasks will define the approximate
>>> number of tasks per job.  It is highly dependent upon the amount of data
>>> being processed as well.  The mapred.tasktracker.map.tasks.maximum and
>>> mapred.tasktracker.reduce.tasks.maximum define the maximum number of
>>> tasks
>>> to run on a single tasktracker for map and reduce tasks.
>>>
>>> When you say 20 jobs I am assuming you mean tasks.  Also what type of
>>> hardware are you running this on, what are your memory settings, running
>>> in
>>> local or DFS mode?
>>>
>>> Dennis
>>>
>>>
>>> Alexander Aristov wrote:
>>>
>>>  Hi all
>>>>
>>>> Can someone suggest me how to restrict number of jobs Nutch lauches in
>>>> hadoop when starts segment merger.
>>>>
>>>> When I run generate, fetch, updatedb tasks Nutch starts about 6-10
>>>> Mapreduce
>>>> jobs (cluster of 2 datanodes) - actual value varies from task to task
>>>> but
>>>> when the script start merging segments it lauches about 20 jobs and
>>>> servers
>>>> get overloaded and crash. Nutch settings are primary default one.
>>>>
>>>> How can I control the number of jobs?
>>>>
>>>> best Regards
>>>> Alexander
>>>>
>>>>
>>>>
>>>>
>>
>>


-- 
Best Regards
Alexander Aristov

Re: segmentmerger spawns too many jobs

Reply via email to