Problem solved: the IP for the troublesome machine wasn't present in
the DNS. Thanks, anyways.

On Mon, Jul 20, 2009 at 3:58 PM, nfantone<[email protected]> wrote:
> Update: I tried running the cluster with two particular nodes, and I
> got the same errors. So, I'm thinking maybe it has something to do
> with the connection to that PC (hadoop-slave01, aka 'orco').
>
> Here's what the jobtracker log shows from the master:
>
> 2009-07-20 15:46:22,366 INFO org.apache.hadoop.mapred.JobInProgress:
> Failed fetch notification #1 for task
> attempt_200907201540_0001_m_000001_0
> 2009-07-20 15:46:28,113 INFO org.apache.hadoop.mapred.TaskInProgress:
> Error from attempt_200907201540_0001_r_000002_0: Shuffle Error:
> Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 2009-07-20 15:46:28,114 INFO org.apache.hadoop.mapred.JobTracker:
> Adding task (cleanup)'attempt_200907201540_0001_r_000002_0' to tip
> task_200907201540_0001_r_000002, for tracker
> 'tracker_orco.3kh.net:localhost/127.0.0.1:59814'
> 2009-07-20 15:46:31,116 INFO org.apache.hadoop.mapred.TaskInProgress:
> Error from attempt_200907201540_0001_r_000000_0: Shuffle Error:
> Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>
> Why does it show 'orco.3kh.net:localhost'? I know it's in /etc/hosts/,
> but I didn't expect to take into account any other lines apart from
> the ones specifying IPs for masters and slaves. Is it attempting to
> connect to itself and failing?
>
>
> On Mon, Jul 20, 2009 at 1:30 PM, nfantone<[email protected]> wrote:
>> Ok, here's my failure report:
>>
>> I can't get more than two nodes working in the cluster. With just a
>> master and a slave, everything seems to go smoothly. However, if I add
>> a third datanode (being the master itself, also a datanode) I keep
>> getting this error while running the wordcount example, which I'm
>> using to test the setup:
>>
>> 09/07/20 12:51:45 INFO mapred.JobClient:  map 100% reduce 17%
>> 09/07/20 12:51:47 INFO mapred.JobClient: Task Id :
>> attempt_200907201251_0001_m_000004_0, Status : FAILED
>> Too many fetch-failures
>> 09/07/20 12:51:48 WARN mapred.JobClient: Error reading task outputNo
>> route to host
>>
>> While the mapping completes, the reduce task gets stuck at around 16%
>> every time. I have googled the error message and read some responses
>> from this list and other related forums, and it seems to be a firewall
>> issue or something about ports not being opened; yet, this is not my
>> case: firewall has been disabled on every node and connection between
>> them (to and from) seems to be fine.
>>
>> Here's my /etc/hosts files for each node:
>>
>>  (master)
>> 127.0.0.1       localhost
>> 127.0.1.1       mauroN-Linux
>> 192.168.200.20  hadoop-master
>> 192.168.200.90  hadoop-slave00
>> 192.168.200.162 hadoop-slave01
>>
>> (slave00)
>> 127.0.0.1       localhost
>> 127.0.1.1       tagore
>> 192.168.200.20  hadoop-master
>> 192.168.200.90  hadoop-slave00
>> 192.168.200.162 hadoop-slave01
>>
>> (slave01)
>> 127.0.0.1       localhost
>> 127.0.1.1       orco.3kh.net orco localhost.localdomain
>> 192.168.200.20  hadoop-master
>> 192.168.200.90  hadoop-slave00
>> 192.168.200.162 hadoop-slave01
>>
>> And .xml conf files, which are the same for each node (just relevant lines):
>>
>> (core-site.xml)
>> <name>hadoop.tmp.dir</name>
>> <value>/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}</value>
>>
>> <name>fs.default.name</name>
>> <value>hdfs://hadoop-master:54310/</value>
>> <final>true</final>
>>
>> (mapred-site.xml)
>> <name>mapred.job.tracker</name>
>> <value>hdfs://hadoop-master:54311/</value>
>> <final>true</final>
>>
>> <name>mapred.map.tasks</name>
>> <value>31</value>
>>
>> <name>mapred.reduce.tasks</name>
>> <value>6</value>
>>
>> (hdfs-site.xml)
>> <name>dfs.replication</name>
>> <value>3</value>
>>
>> I noticed that if I reduce the number of mapred.reduce.tasks to 2 or
>> 3, the error does not pop up, but it takes quite a long time to finish
>> (more than the time it takes for a single machine to finish it). I
>> have blacklisted ipv6 and enabled ip_forward in every node (sudo echo
>> 1 > /proc/sys/net/ipv4/ip_forward). Should anyone need some info from
>> the datanodes logs, I could post it. I'm running out of ideas... and
>> in need of enlightenment.
>>
>> On Thu, Jul 16, 2009 at 9:39 AM, nfantone<[email protected]> wrote:
>>> I really appreciate all your suggestions, but from where I am and
>>> considering the place I work at (a rather small office in Argentina)
>>> these things aren't that affordable (monetarily and bureaucratically
>>> speaking). That being said, I managed to get my hands around some more
>>> equipment and I may be able to set up a small cluster of three or four
>>> nodes - all running in a local network with Ubuntu. What I should
>>> learn now is exactly how to configure all that is needed in order to
>>> create it, as I have virtually no idea, nor experience in this kind of
>>> tasks. Luckily, goggling led me to some tutorials and documentation on
>>> the subject. I'll be following this guide for now:
>>>
>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
>>>
>>> I'll let know what comes out this (surely, something on the messy side
>>> of things). Any more suggestions/ideas are more than welcome. Many
>>> thanks, again.
>>>
>>
>

Reply via email to