Re: Hadoop 'wordcount' program hanging in the Reduce phase.

Gaurav Agarwal Wed, 07 Mar 2007 13:21:19 -0800

Hi Brian,

I tried the configuration changes suggested by you but it did not work for
me. (I am beginning to get a feeling that make a node function as both
master and slave is a bad idea!).


Can you experiment this for me on your cluster:

Config: 2 node cluster.
Node1: Acts as both master and slave
Node 2: Acts as slave only.

An input file of ~5M

Word count example program using the command:
bin/hadoop jar hadoop-0.12.0-examples.jar wordcount -m 4 input output

I really appreciate your help. Thanks in advance!

-gaurav


Brian Wedel-2 wrote:
> 
> I am experimenting on a small cluster as well (4 machines) and I had
> success with the following configuration:
> 
>  - configuration files on both the master and slaves are the same
>  - in the master/slave lists I only used the ip address (not
> localhost) and ommited the user e.g. (hadoop@)
>  - in the fs.default.name configuration variable use
> hdfs://<host>:<port> (I don't know if this is necessary - but it seems
> you can specify other types of filesystems - not sure which is
> default)
>  - use 0.12.0 release - I was using 0.11.2 and was getting some odd
> errors that disappeared  when I upgraded
>  - I don't run a datanode daemon on the same machine a the namenode --
> this was a problem when I was trying the hadoop-streaming contributed
> package for scripting.  Not sure if it matters for the examples
> 
> This configuration worked me.
> -Brian
> 
> On 3/7/07, Gaurav Agarwal <[EMAIL PROTECTED]> wrote:
>>
>> Hi Richard,
>>
>> I am facing this error very consistently. I have tried the another
>> nightly
>> build (4 Mar) but that gave same exception.
>>
>> thanks,
>> gaurav
>>
>>
>>
>> Richard Yang-3 wrote:
>> >
>> > Hi Gaurav,
>> >
>> > Does this error always happen??
>> > Our settings are similar.
>> > Mine contains some error messages about IOExceptions, not able to
>> obtain
>> > certain blocks, not able to create a new block.  Although the program
>> hung
>> > some time, in most cases, they were able to complete with correct
>> results.
>> > Btw, I am running the grep sample program on version 0.11.2.
>> >
>> > Best Regards
>> >
>> > Richard Yang
>> > [EMAIL PROTECTED]
>> > [EMAIL PROTECTED]
>> >
>> >
>> > -----Original Message-----
>> > From: Gaurav Agarwal [mailto:[EMAIL PROTECTED]
>> > Sent: Wednesday, March 07, 2007 12:22 AM
>> > To: [email protected]
>> > Subject: Hadoop 'wordcount' program hanging in the Reduce phase.
>> >
>> >
>> > Hi Everyone!
>> > I am new user to Hadoop and trying to set up a small cluster using
>> Hadoop.
>> > but I am facing some issues doing that.
>> >
>> > I am trying to run the Hadoop 'wordcount' example program which come
>> > bundled
>> > with it. I am able to successfully run the program on a single node
>> > cluster
>> > (that is using my local machine only). But, when I try to run the same
>> > program on a cluster of two machines, the program hangs in the 'reduce'
>> > phase.
>> >
>> >
>> > Settings:
>> >
>> > Master Node: 192.168.1.150 (dennis-laptop)
>> > Slave Node: 192.168.1.201 (traal)
>> >
>> > User Account on both Master and Slave is named : Hadoop
>> >
>> > Password-less ssh login to Slave from the Master is working.
>> >
>> > JAVA_HOME is set appropriately in the hadoop-env.sh file on both
>> > Master/Slave.
>> >
>> > MASTER
>> >
>> > 1) conf/slaves
>> > localhost
>> > [EMAIL PROTECTED]
>> >
>> > 2) conf/master
>> > localhost
>> >
>> > 3) conf/hadoop-site.xml
>> > <?xml version="1.0"?>
>> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> >
>> > <!-- Put site-specific property overrides in this file. -->
>> >
>> > <configuration>
>> > <property>
>> >          <name>fs.default.name</name>
>> >          <value>192.168.1.150:50000</value>
>> >     </property>
>> >
>> >     <property>
>> >          <name>mapred.job.tracker</name>
>> >          <value>192.168.1.150:50001</value>
>> >      </property>
>> >
>> >     <property>
>> >          <name>dfs.replication</name>
>> >          <value>2</value>
>> >     </property>
>> > </configuration>
>> >
>> > SLAVE
>> >
>> > 1) conf/slaves
>> > localhost
>> >
>> > 2) conf/master
>> > [EMAIL PROTECTED]
>> >
>> > 3) conf/hadoop-site.xml
>> > <?xml version="1.0"?>
>> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>> >
>> > <!-- Put site-specific property overrides in this file. -->
>> >
>> > <configuration>
>> > <property>
>> >          <name>fs.default.name</name>
>> >          <value>192.168.1.150:50000</value>
>> >     </property>
>> >
>> >     <property>
>> >          <name>mapred.job.tracker</name>
>> >          <value>192.168.1.150:50001</value>
>> >      </property>
>> >
>> >     <property>
>> >          <name>dfs.replication</name>
>> >          <value>2</value>
>> >     </property>
>> > </configuration>
>> >
>> >
>> > CONSOLE OUTPUT
>> > bin/hadoop jar hadoop-*-examples.jar wordcount -m 10 -r 2 input output
>> > 07/03/06 23:17:17 INFO mapred.InputFormatBase: Total input paths to
>> > process
>> > : 1
>> > 07/03/06 23:17:18 INFO mapred.JobClient: Running job: job_0001
>> > 07/03/06 23:17:19 INFO mapred.JobClient:  map 0% reduce 0%
>> > 07/03/06 23:17:29 INFO mapred.JobClient:  map 20% reduce 0%
>> > 07/03/06 23:17:30 INFO mapred.JobClient:  map 40% reduce 0%
>> > 07/03/06 23:17:32 INFO mapred.JobClient:  map 80% reduce 0%
>> > 07/03/06 23:17:33 INFO mapred.JobClient:  map 100% reduce 0%
>> > 07/03/06 23:17:42 INFO mapred.JobClient:  map 100% reduce 3%
>> > 07/03/06 23:17:43 INFO mapred.JobClient:  map 100% reduce 5%
>> > 07/03/06 23:17:44 INFO mapred.JobClient:  map 100% reduce 8%
>> > 07/03/06 23:17:52 INFO mapred.JobClient:  map 100% reduce 10%
>> > 07/03/06 23:17:53 INFO mapred.JobClient:  map 100% reduce 13%
>> > 07/03/06 23:18:03 INFO mapred.JobClient:  map 100% reduce 16%
>> >
>> >
>> > The only exception I can see from the log files is in the 'TaskTracker'
>> > log
>> > file:
>> >
>> > 2007-03-06 23:17:32,214 INFO org.apache.hadoop.mapred.TaskRunner:
>> > task_0001_r_000000_0 Copying task_0001_m_000002_0 output from traal.
>> > 2007-03-06 23:17:32,221 INFO org.apache.hadoop.mapred.TaskRunner:
>> > task_0001_r_000000_0 Copying task_0001_m_000001_0 output from
>> > dennis-laptop.
>> > 2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
>> > task_0001_r_000000_0 copy failed: task_0001_m_000002_0 from traal
>> > 2007-03-06 23:17:32,368 WARN org.apache.hadoop.mapred.TaskRunner:
>> > java.io.IOException: File
>> > /tmp/hadoop-hadoop/mapred/local/task_0001_r_000000_0/map_2.out-0 not
>> > created
>> > at
>> >
>> org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.copyOutput(ReduceT
>> > askRunner.java:301)
>> > at
>> >
>> org.apache.hadoop.mapred.ReduceTaskRunner$MapOutputCopier.run(ReduceTaskRunn
>> > er.java:262)
>> >
>> > 2007-03-06 23:17:32,369 WARN org.apache.hadoop.mapred.TaskRunner:
>> > task_0001_r_000000_0 adding host traal to penalty box, next contact in
>> 99
>> > seconds
>> >
>> > I am attaching the master log files just in case anyone wants to check
>> > them.
>> >
>> > Any help will be greatly appreciated!
>> >
>> > -gaurav
>> >
>> >
>> http://www.nabble.com/file/7013/hadoop-hadoop-tasktracker-dennis-laptop.log
>> > hadoop-hadoop-tasktracker-dennis-laptop.log </br>
>> >
>> http://www.nabble.com/file/7012/hadoop-hadoop-jobtracker-dennis-laptop.log
>> > hadoop-hadoop-jobtracker-dennis-laptop.log </br>
>> >
>> http://www.nabble.com/file/7011/hadoop-hadoop-namenode-dennis-laptop.log
>> > hadoop-hadoop-namenode-dennis-laptop.log </br>
>> >
>> http://www.nabble.com/file/7010/hadoop-hadoop-datanode-dennis-laptop.log
>> > hadoop-hadoop-datanode-dennis-laptop.log
>> > --
>> > View this message in context:
>> >
>> http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-p
>> > hase.-tf3360661.html#a9348424
>> > Sent from the Hadoop Users mailing list archive at Nabble.com.
>> >
>> >
>> >
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-phase.-tf3360661.html#a9357648
>> Sent from the Hadoop Users mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Hadoop-%27wordcount%27-program-hanging-in-the-Reduce-phase.-tf3360661.html#a9362460
Sent from the Hadoop Users mailing list archive at Nabble.com.

Re: Hadoop 'wordcount' program hanging in the Reduce phase.

Reply via email to