Ok, here's my failure report:
I can't get more than two nodes working in the cluster. With just a
master and a slave, everything seems to go smoothly. However, if I add
a third datanode (being the master itself, also a datanode) I keep
getting this error while running the wordcount example, which I'm
using to test the setup:
09/07/20 12:51:45 INFO mapred.JobClient: map 100% reduce 17%
09/07/20 12:51:47 INFO mapred.JobClient: Task Id :
attempt_200907201251_0001_m_000004_0, Status : FAILED
Too many fetch-failures
09/07/20 12:51:48 WARN mapred.JobClient: Error reading task outputNo
route to host
While the mapping completes, the reduce task gets stuck at around 16%
every time. I have googled the error message and read some responses
from this list and other related forums, and it seems to be a firewall
issue or something about ports not being opened; yet, this is not my
case: firewall has been disabled on every node and connection between
them (to and from) seems to be fine.
Here's my /etc/hosts files for each node:
(master)
127.0.0.1 localhost
127.0.1.1 mauroN-Linux
192.168.200.20 hadoop-master
192.168.200.90 hadoop-slave00
192.168.200.162 hadoop-slave01
(slave00)
127.0.0.1 localhost
127.0.1.1 tagore
192.168.200.20 hadoop-master
192.168.200.90 hadoop-slave00
192.168.200.162 hadoop-slave01
(slave01)
127.0.0.1 localhost
127.0.1.1 orco.3kh.net orco localhost.localdomain
192.168.200.20 hadoop-master
192.168.200.90 hadoop-slave00
192.168.200.162 hadoop-slave01
And .xml conf files, which are the same for each node (just relevant lines):
(core-site.xml)
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}</value>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:54310/</value>
<final>true</final>
(mapred-site.xml)
<name>mapred.job.tracker</name>
<value>hdfs://hadoop-master:54311/</value>
<final>true</final>
<name>mapred.map.tasks</name>
<value>31</value>
<name>mapred.reduce.tasks</name>
<value>6</value>
(hdfs-site.xml)
<name>dfs.replication</name>
<value>3</value>
I noticed that if I reduce the number of mapred.reduce.tasks to 2 or
3, the error does not pop up, but it takes quite a long time to finish
(more than the time it takes for a single machine to finish it). I
have blacklisted ipv6 and enabled ip_forward in every node (sudo echo
1 > /proc/sys/net/ipv4/ip_forward). Should anyone need some info from
the datanodes logs, I could post it. I'm running out of ideas... and
in need of enlightenment.
On Thu, Jul 16, 2009 at 9:39 AM, nfantone<[email protected]> wrote:
> I really appreciate all your suggestions, but from where I am and
> considering the place I work at (a rather small office in Argentina)
> these things aren't that affordable (monetarily and bureaucratically
> speaking). That being said, I managed to get my hands around some more
> equipment and I may be able to set up a small cluster of three or four
> nodes - all running in a local network with Ubuntu. What I should
> learn now is exactly how to configure all that is needed in order to
> create it, as I have virtually no idea, nor experience in this kind of
> tasks. Luckily, goggling led me to some tutorials and documentation on
> the subject. I'll be following this guide for now:
>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
>
> I'll let know what comes out this (surely, something on the messy side
> of things). Any more suggestions/ideas are more than welcome. Many
> thanks, again.
>