AFAIK there is no way to disable this "feature" . This is an optimization. It happens because in your case the node generating the data is also a data node.
Raj >________________________________ > From: Stijn De Weirdt <stijn.dewei...@ugent.be> >To: common-user@hadoop.apache.org >Sent: Monday, April 2, 2012 12:18 PM >Subject: Re: data distribution in HDFS > >thanks serge. > > >is there a way to disable this "feature" (ie place first block always on >local node)? >and is this because the local node is a datanode? or is there always a >"local node" with datatransfers? > >many thanks, > >stijn > >> Local node is a node from where you are coping data from >> >> If lets say you are using -copyFromLocal option >> >> >> Regards >> Serge >> >> On 4/2/12 11:53 AM, "Stijn De Weirdt"<stijn.dewei...@ugent.be> wrote: >> >>> hi raj, >>> >>> what is a "local node"? is it relative to the tasks that are started? >>> >>> >>> stijn >>> >>> On 04/02/2012 07:28 PM, Raj Vishwanathan wrote: >>>> Stijn, >>>> >>>> The first block of the data , is always stored in the local node. >>>> Assuming that you had a replication factor of 3, the node that generates >>>> the data will get about 10GB of data and the other 20GB will be >>>> distributed among other nodes. >>>> >>>> Raj >>>> >>>> >>>> >>>> >>>> >>>>> ________________________________ >>>>> From: Stijn De Weirdt<stijn.dewei...@ugent.be> >>>>> To: common-user@hadoop.apache.org >>>>> Sent: Monday, April 2, 2012 9:54 AM >>>>> Subject: data distribution in HDFS >>>>> >>>>> hi all, >>>>> >>>>> i'm just started to play around with hdfs+mapred. i'm currently >>>>> playing with teragen/sort/validate to see if i understand all. >>>>> >>>>> the test setup involves 5 nodes that all are tasktracker and datanode >>>>> (and one node that is also jobtracker and namenode on top of that. >>>>> (this one node is running both the namenode hadoop process as the >>>>> datanode process) >>>>> >>>>> when i do the in teragen run, the data is not distributed equally over >>>>> all nodes. the node that is also namenode, get's a bigger portion of >>>>> all the data. (as seen by df on the nodes and by using dsfadmin -report) >>>>> i also get this distribution when i ran the TestDFSIO write test (50 >>>>> files of 1GB) >>>>> >>>>> >>>>> i use basic command line teragen $((100*1000*1000)) >>>>> /benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add >>>>> the volumes in use by hdfs, it's actually quite a bit more.) >>>>> 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in >>>>> use. so this one datanode is seen as 2 nodes. >>>>> >>>>> when i do ls on the filesystem, i see that teragen created 250MB >>>>> files, the current hdfs blocksize is 64MB. >>>>> >>>>> is there a reason why one datanode is preferred over the others. >>>>> it is annoying since the terasort output behaves the same, and i can't >>>>> use the full hdfs space for testing that way. also, since more IO comes >>>>> to this one node, the performance isn't really balanced. >>>>> >>>>> many thanks, >>>>> >>>>> stijn >>>>> >>>>> >>>>> >>> >> >> > > > >