I am attaching the config files I was using for these runs with this email. I am not sure if something in them is causing this non-data locality of Hadoop.
Thanks, Virajith On Tue, Jul 12, 2011 at 3:36 PM, Virajith Jalaparti <virajit...@gmail.com>wrote: > I am using a replication factor of 1 since I dont to incur the overhead of > replication and I am not much worried about reliability. > > I am just using the default Hadoop scheduler (FIFO, I think!). In case of a > single rack, rack-locality doesn't really have any meaning. Obviously > everything will run in the same rack. I am concerned about data-local maps. > I assumed that Hadoop would do a much better job at ensuring data-local maps > but it doesnt seem to be the case here. > > -Virajith > > > On Tue, Jul 12, 2011 at 3:30 PM, Arun C Murthy <a...@hortonworks.com>wrote: > >> Why are you running with replication factor of 1? >> >> Also, it depends on the scheduler you are using. The CapacityScheduler in >> 0.20.203 (not 0.20.2) has much better locality for jobs, similarly with >> FairScheduler. >> >> IAC, running on a single rack with replication of 1 implies rack-locality >> for all tasks which, in most cases, is good enough. >> >> Arun >> >> On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote: >> >> > Hi, >> > >> > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of >> input data using a 20 node cluster of nodes. HDFS is configured to use 128MB >> block size (so 1600maps are created) and a replication factor of 1 is being >> used. All the 20 nodes are also hdfs datanodes. I was using a bandwidth >> value of 50Mbps between each of the nodes (this was configured using linux >> "tc"). I see that around 90% of the map tasks are reading data over the >> network i.e. most of the map tasks are not being scheduled at the nodes >> where the data to be processed by them is located. >> > My understanding was that Hadoop tries to schedule as many data-local >> maps as possible. But in this situation, this does not seem to happen. Any >> reason why this is happening? and is there a way to actually configure >> hadoop to ensure the maximum possible node locality? >> > Any help regarding this is very much appreciated. >> > >> > Thanks, >> > Virajith >> >> >
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>fs.default.name</name><value>hdfs://10.1.1.2:9000</value></property> <property><name>hadoop.tmp.dir</name><value>/hadoop/mapred/,/mnt/local/mapred/</value></property> </configuration>
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.secondary.http.address</name><value>10.1.1.2:50090</value></property> <property><name>dfs.datanode.address</name><value>10.1.1.3:50010</value></property> <property><name>dfs.datanode.http.address</name><value>10.1.1.3:50075</value></property> <property><name>dfs.datanode.ipc.address</name><value>10.1.1.3:50020</value></property> <property><name>dfs.http.address</name><value>10.1.1.2:50070</value></property> <property><name>dfs.data.dir</name><value>/mnt/local/hdfs/data</value></property> <property> <name>dfs.name.dir</name> <value>/mnt/local/hdfs/name</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.block.size</name> <value>134217728</value> </property> </configuration>
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>mapred.job.tracker.http.address</name><value>10.1.1.2:50030</value></property> <property><name>mapred.task.tracker.http.address</name><value>10.1.1.3:50060</value></property> <property><name>slave.host.name</name><value>10.1.1.3</value></property> <property><name>mapred.job.tracker</name><value>10.1.1.2:55000</value></property> <property><name>mapred.system.dir</name><value>/hadoop/mapred,/mnt/local/mapred/system</value></property> <property><name>mapred.local.dir</name><value>/hadoop/mapred,/mnt/local/mapred/local</value></property> <property><name>mapred.temp.dir</name><value>/hadoop/mapred,/mnt/local/mapred/temp</value></property> <property><name>mapred.tasktracker.map.tasks.maximum</name><value>1000</value></property> <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>1000</value></property> <property><name>mapred.reduce.slowstart.completed.maps</name><value> 1</value></property> <property> <name>mapred.queue.names</name> <value>default</value> </property> <property><name>mapred.acls.enabled</name> <value>false</value> </property> <property> <name>mapred.reduce.parallel.copies</name> <value>20</value> </property> <property> <name>mapred.map.child.java.opts</name> <value>-Xmx512M</value> </property> <property> <name>mapred.reduce.child.java.opts</name> <value>-Xmx512M</value> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> </property> <property><name>mapred.tasktracker.reduce.tasks.maximum</name><value>3</value></property> </configuration>