After setting the cluster up with 6 computers (two of them being
QuadCore and the others, DualCore, totaling 16 slave cores) and
running a KMeansDriver job with 32 reduce tasks and ~80 map tasks
spawned it's STILL awfully slow.

./bin/hadoop jar ~/mahout-core-0.2.jar
org.apache.mahout.clustering.kmeans.KMeansDriver -i input/user.data -c
init -o output -r 32 -d 0.001 -k 200

Using a pretty small dataset of 62MB it took more than a whole day to
complete. Datanodes and Jobtrackers logs don't show any visible
errors, either. Would you mind sharing any piece of advice that could
help me tune this thing up with my settings?



On Tue, Jul 21, 2009 at 9:05 AM, nfantone<[email protected]> wrote:
> Problem solved: the IP for the troublesome machine wasn't present in
> the DNS. Thanks, anyways.
>
> On Mon, Jul 20, 2009 at 3:58 PM, nfantone<[email protected]> wrote:
>> Update: I tried running the cluster with two particular nodes, and I
>> got the same errors. So, I'm thinking maybe it has something to do
>> with the connection to that PC (hadoop-slave01, aka 'orco').
>>
>> Here's what the jobtracker log shows from the master:
>>
>> 2009-07-20 15:46:22,366 INFO org.apache.hadoop.mapred.JobInProgress:
>> Failed fetch notification #1 for task
>> attempt_200907201540_0001_m_000001_0
>> 2009-07-20 15:46:28,113 INFO org.apache.hadoop.mapred.TaskInProgress:
>> Error from attempt_200907201540_0001_r_000002_0: Shuffle Error:
>> Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>> 2009-07-20 15:46:28,114 INFO org.apache.hadoop.mapred.JobTracker:
>> Adding task (cleanup)'attempt_200907201540_0001_r_000002_0' to tip
>> task_200907201540_0001_r_000002, for tracker
>> 'tracker_orco.3kh.net:localhost/127.0.0.1:59814'
>> 2009-07-20 15:46:31,116 INFO org.apache.hadoop.mapred.TaskInProgress:
>> Error from attempt_200907201540_0001_r_000000_0: Shuffle Error:
>> Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>
>> Why does it show 'orco.3kh.net:localhost'? I know it's in /etc/hosts/,
>> but I didn't expect to take into account any other lines apart from
>> the ones specifying IPs for masters and slaves. Is it attempting to
>> connect to itself and failing?
>>
>>
>> On Mon, Jul 20, 2009 at 1:30 PM, nfantone<[email protected]> wrote:
>>> Ok, here's my failure report:
>>>
>>> I can't get more than two nodes working in the cluster. With just a
>>> master and a slave, everything seems to go smoothly. However, if I add
>>> a third datanode (being the master itself, also a datanode) I keep
>>> getting this error while running the wordcount example, which I'm
>>> using to test the setup:
>>>
>>> 09/07/20 12:51:45 INFO mapred.JobClient:  map 100% reduce 17%
>>> 09/07/20 12:51:47 INFO mapred.JobClient: Task Id :
>>> attempt_200907201251_0001_m_000004_0, Status : FAILED
>>> Too many fetch-failures
>>> 09/07/20 12:51:48 WARN mapred.JobClient: Error reading task outputNo
>>> route to host
>>>
>>> While the mapping completes, the reduce task gets stuck at around 16%
>>> every time. I have googled the error message and read some responses
>>> from this list and other related forums, and it seems to be a firewall
>>> issue or something about ports not being opened; yet, this is not my
>>> case: firewall has been disabled on every node and connection between
>>> them (to and from) seems to be fine.
>>>
>>> Here's my /etc/hosts files for each node:
>>>
>>>  (master)
>>> 127.0.0.1       localhost
>>> 127.0.1.1       mauroN-Linux
>>> 192.168.200.20  hadoop-master
>>> 192.168.200.90  hadoop-slave00
>>> 192.168.200.162 hadoop-slave01
>>>
>>> (slave00)
>>> 127.0.0.1       localhost
>>> 127.0.1.1       tagore
>>> 192.168.200.20  hadoop-master
>>> 192.168.200.90  hadoop-slave00
>>> 192.168.200.162 hadoop-slave01
>>>
>>> (slave01)
>>> 127.0.0.1       localhost
>>> 127.0.1.1       orco.3kh.net orco localhost.localdomain
>>> 192.168.200.20  hadoop-master
>>> 192.168.200.90  hadoop-slave00
>>> 192.168.200.162 hadoop-slave01
>>>
>>> And .xml conf files, which are the same for each node (just relevant lines):
>>>
>>> (core-site.xml)
>>> <name>hadoop.tmp.dir</name>
>>> <value>/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}</value>
>>>
>>> <name>fs.default.name</name>
>>> <value>hdfs://hadoop-master:54310/</value>
>>> <final>true</final>
>>>
>>> (mapred-site.xml)
>>> <name>mapred.job.tracker</name>
>>> <value>hdfs://hadoop-master:54311/</value>
>>> <final>true</final>
>>>
>>> <name>mapred.map.tasks</name>
>>> <value>31</value>
>>>
>>> <name>mapred.reduce.tasks</name>
>>> <value>6</value>
>>>
>>> (hdfs-site.xml)
>>> <name>dfs.replication</name>
>>> <value>3</value>
>>>
>>> I noticed that if I reduce the number of mapred.reduce.tasks to 2 or
>>> 3, the error does not pop up, but it takes quite a long time to finish
>>> (more than the time it takes for a single machine to finish it). I
>>> have blacklisted ipv6 and enabled ip_forward in every node (sudo echo
>>> 1 > /proc/sys/net/ipv4/ip_forward). Should anyone need some info from
>>> the datanodes logs, I could post it. I'm running out of ideas... and
>>> in need of enlightenment.
>>>
>>> On Thu, Jul 16, 2009 at 9:39 AM, nfantone<[email protected]> wrote:
>>>> I really appreciate all your suggestions, but from where I am and
>>>> considering the place I work at (a rather small office in Argentina)
>>>> these things aren't that affordable (monetarily and bureaucratically
>>>> speaking). That being said, I managed to get my hands around some more
>>>> equipment and I may be able to set up a small cluster of three or four
>>>> nodes - all running in a local network with Ubuntu. What I should
>>>> learn now is exactly how to configure all that is needed in order to
>>>> create it, as I have virtually no idea, nor experience in this kind of
>>>> tasks. Luckily, goggling led me to some tutorials and documentation on
>>>> the subject. I'll be following this guide for now:
>>>>
>>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
>>>>
>>>> I'll let know what comes out this (surely, something on the messy side
>>>> of things). Any more suggestions/ideas are more than welcome. Many
>>>> thanks, again.
>>>>
>>>
>>
>

Reply via email to