Problem solved: the IP for the troublesome machine wasn't present in the DNS. Thanks, anyways.
On Mon, Jul 20, 2009 at 3:58 PM, nfantone<[email protected]> wrote: > Update: I tried running the cluster with two particular nodes, and I > got the same errors. So, I'm thinking maybe it has something to do > with the connection to that PC (hadoop-slave01, aka 'orco'). > > Here's what the jobtracker log shows from the master: > > 2009-07-20 15:46:22,366 INFO org.apache.hadoop.mapred.JobInProgress: > Failed fetch notification #1 for task > attempt_200907201540_0001_m_000001_0 > 2009-07-20 15:46:28,113 INFO org.apache.hadoop.mapred.TaskInProgress: > Error from attempt_200907201540_0001_r_000002_0: Shuffle Error: > Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. > 2009-07-20 15:46:28,114 INFO org.apache.hadoop.mapred.JobTracker: > Adding task (cleanup)'attempt_200907201540_0001_r_000002_0' to tip > task_200907201540_0001_r_000002, for tracker > 'tracker_orco.3kh.net:localhost/127.0.0.1:59814' > 2009-07-20 15:46:31,116 INFO org.apache.hadoop.mapred.TaskInProgress: > Error from attempt_200907201540_0001_r_000000_0: Shuffle Error: > Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. > > Why does it show 'orco.3kh.net:localhost'? I know it's in /etc/hosts/, > but I didn't expect to take into account any other lines apart from > the ones specifying IPs for masters and slaves. Is it attempting to > connect to itself and failing? > > > On Mon, Jul 20, 2009 at 1:30 PM, nfantone<[email protected]> wrote: >> Ok, here's my failure report: >> >> I can't get more than two nodes working in the cluster. With just a >> master and a slave, everything seems to go smoothly. However, if I add >> a third datanode (being the master itself, also a datanode) I keep >> getting this error while running the wordcount example, which I'm >> using to test the setup: >> >> 09/07/20 12:51:45 INFO mapred.JobClient: map 100% reduce 17% >> 09/07/20 12:51:47 INFO mapred.JobClient: Task Id : >> attempt_200907201251_0001_m_000004_0, Status : FAILED >> Too many fetch-failures >> 09/07/20 12:51:48 WARN mapred.JobClient: Error reading task outputNo >> route to host >> >> While the mapping completes, the reduce task gets stuck at around 16% >> every time. I have googled the error message and read some responses >> from this list and other related forums, and it seems to be a firewall >> issue or something about ports not being opened; yet, this is not my >> case: firewall has been disabled on every node and connection between >> them (to and from) seems to be fine. >> >> Here's my /etc/hosts files for each node: >> >> (master) >> 127.0.0.1 localhost >> 127.0.1.1 mauroN-Linux >> 192.168.200.20 hadoop-master >> 192.168.200.90 hadoop-slave00 >> 192.168.200.162 hadoop-slave01 >> >> (slave00) >> 127.0.0.1 localhost >> 127.0.1.1 tagore >> 192.168.200.20 hadoop-master >> 192.168.200.90 hadoop-slave00 >> 192.168.200.162 hadoop-slave01 >> >> (slave01) >> 127.0.0.1 localhost >> 127.0.1.1 orco.3kh.net orco localhost.localdomain >> 192.168.200.20 hadoop-master >> 192.168.200.90 hadoop-slave00 >> 192.168.200.162 hadoop-slave01 >> >> And .xml conf files, which are the same for each node (just relevant lines): >> >> (core-site.xml) >> <name>hadoop.tmp.dir</name> >> <value>/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}</value> >> >> <name>fs.default.name</name> >> <value>hdfs://hadoop-master:54310/</value> >> <final>true</final> >> >> (mapred-site.xml) >> <name>mapred.job.tracker</name> >> <value>hdfs://hadoop-master:54311/</value> >> <final>true</final> >> >> <name>mapred.map.tasks</name> >> <value>31</value> >> >> <name>mapred.reduce.tasks</name> >> <value>6</value> >> >> (hdfs-site.xml) >> <name>dfs.replication</name> >> <value>3</value> >> >> I noticed that if I reduce the number of mapred.reduce.tasks to 2 or >> 3, the error does not pop up, but it takes quite a long time to finish >> (more than the time it takes for a single machine to finish it). I >> have blacklisted ipv6 and enabled ip_forward in every node (sudo echo >> 1 > /proc/sys/net/ipv4/ip_forward). Should anyone need some info from >> the datanodes logs, I could post it. I'm running out of ideas... and >> in need of enlightenment. >> >> On Thu, Jul 16, 2009 at 9:39 AM, nfantone<[email protected]> wrote: >>> I really appreciate all your suggestions, but from where I am and >>> considering the place I work at (a rather small office in Argentina) >>> these things aren't that affordable (monetarily and bureaucratically >>> speaking). That being said, I managed to get my hands around some more >>> equipment and I may be able to set up a small cluster of three or four >>> nodes - all running in a local network with Ubuntu. What I should >>> learn now is exactly how to configure all that is needed in order to >>> create it, as I have virtually no idea, nor experience in this kind of >>> tasks. Luckily, goggling led me to some tutorials and documentation on >>> the subject. I'll be following this guide for now: >>> >>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster) >>> >>> I'll let know what comes out this (surely, something on the messy side >>> of things). Any more suggestions/ideas are more than welcome. Many >>> thanks, again. >>> >> >
