And I just got an another idea. The Vanilla plugin has data-locality disabled by default (I don't remember exactly why). I am not sure how mapper will behave without data-locality info, i.e. will it understand that it should always take data from its own node? So it is worth trying to enable it. Here is the doc:
http://savanna.readthedocs.org/en/0.3/userdoc/features.html#data-locality Dmitry 2013/12/18 Dmitry Mescheryakov <[email protected]> > Mark, > > I believe we didn't face the problem so far. Did you test network > connection between nodes on its stability and throughput? Maybe the error > is caused by network oversaturation. > > Though the errors show network as a problem, it might be worth checking > with Hadoop community if such exceptions could be caused by reason > different from network malfunction. > > Dmitry > > > > > 2013/12/18 Marc Solanas Tarre -X (msolanas - AAP3 INC at Cisco) < > [email protected]> > >> Hi, >> >> I asked this question in Launchpad ( >> https://answers.launchpad.net/savanna/+question/240969), but I thought >> it might reach more people if I use the list. >> >> My set up is: >> >> Ubuntu 12.04 >> OpenStack Havana with Vanilla Plugin >> >> I have deployed a cluster with the following node groups: >> >> 1 x master: >> >> -Uses 1 cinder volume : 2TB >> >> -namenode >> -secondarynamenode >> -oozie >> -datanode >> -jobtracker >> -tasktracker >> >> 2x slaves: >> >> -Uses 1 cinder volume: 2TB >> >> -datanode >> -tasktracker >> >> Both node groups used the following flavor: >> >> VCPUs: 32 >> RAM: 250000 >> Root disk: 300GB >> Ephemeral: 300GB >> Swap: 0 >> >> They also use the default Ubuntu Hadoop Vanilla image downloadable from >> https://savanna.readthedocs.org/en/latest/userdoc/vanilla_plugin.html >> >> The /etc/hosts file in all nodes is: >> 127.0.0.1 localhost >> 10.0.0.2 test-master2T-001.novalocal test-master2T-001 >> 10.0.0.3 test-slave2T-001.novalocal test-slave2T-001 >> 10.0.0.4 test-slave2T-002.novalocal test-slave2T-002 >> >> Without changing any of the default configuration, the cluster boots >> correctly. >> >> The problem is that, when running a job (for example, teragen 100GB), the >> map tasks fail many times, having to repeat them, thus increasing the job >> time. They seem to fail randomly, from one slave or the other, depending on >> the execution. >> >> Checking the logs of the datanotes in the slaves, I can see this error: >> >> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.net. >> ConnectException: Call to test-master2T-001/10.0.0.2:8020 failed on >> connection exception: java.net.ConnectException: Connection refused >> >> Full error: http://pastebin.com/DDp39yqt >> >> The logs of the datanode in the master, gives this error: >> >> WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: >> exception: >> java.net.SocketException: Original Exception : java.io.IOException: >> Connection reset by peer >> >> Full error: http://pastebin.com/NXYXELQX >> >> I have tried changing hadoop.tmp.dir to point to the 2TB cinder volume >> /volumes/disk1/lib/hadoop/hdfs/tmp, but nothing changed. >> >> Thank you in advance. >> >> Marc >> >> _______________________________________________ >> Mailing list: >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack >> Post to : [email protected] >> Unsubscribe : >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack >> >> >
_______________________________________________ Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack Post to : [email protected] Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
