Local node is a node from where you are coping data from If lets say you are using -copyFromLocal option
Regards Serge On 4/2/12 11:53 AM, "Stijn De Weirdt" <stijn.dewei...@ugent.be> wrote: >hi raj, > >what is a "local node"? is it relative to the tasks that are started? > > >stijn > >On 04/02/2012 07:28 PM, Raj Vishwanathan wrote: >> Stijn, >> >> The first block of the data , is always stored in the local node. >>Assuming that you had a replication factor of 3, the node that generates >>the data will get about 10GB of data and the other 20GB will be >>distributed among other nodes. >> >> Raj >> >> >> >> >> >>> ________________________________ >>> From: Stijn De Weirdt<stijn.dewei...@ugent.be> >>> To: common-user@hadoop.apache.org >>> Sent: Monday, April 2, 2012 9:54 AM >>> Subject: data distribution in HDFS >>> >>> hi all, >>> >>> i'm just started to play around with hdfs+mapred. i'm currently >>>playing with teragen/sort/validate to see if i understand all. >>> >>> the test setup involves 5 nodes that all are tasktracker and datanode >>>(and one node that is also jobtracker and namenode on top of that. >>>(this one node is running both the namenode hadoop process as the >>>datanode process) >>> >>> when i do the in teragen run, the data is not distributed equally over >>>all nodes. the node that is also namenode, get's a bigger portion of >>>all the data. (as seen by df on the nodes and by using dsfadmin -report) >>> i also get this distribution when i ran the TestDFSIO write test (50 >>>files of 1GB) >>> >>> >>> i use basic command line teragen $((100*1000*1000)) >>>/benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add >>>the volumes in use by hdfs, it's actually quite a bit more.) >>> 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in >>>use. so this one datanode is seen as 2 nodes. >>> >>> when i do ls on the filesystem, i see that teragen created 250MB >>>files, the current hdfs blocksize is 64MB. >>> >>> is there a reason why one datanode is preferred over the others. >>> it is annoying since the terasort output behaves the same, and i can't >>>use the full hdfs space for testing that way. also, since more IO comes >>>to this one node, the performance isn't really balanced. >>> >>> many thanks, >>> >>> stijn >>> >>> >>> >