Re: data distribution in HDFS

Serge Blazhievsky Mon, 02 Apr 2012 11:57:44 -0700

Local node is a node from where you are coping data from

If lets say you are using -copyFromLocal option



Regards
Serge

On 4/2/12 11:53 AM, "Stijn De Weirdt" <stijn.dewei...@ugent.be> wrote:

>hi raj,
>
>what is a "local node"? is it relative to the tasks that are started?
>
>
>stijn
>
>On 04/02/2012 07:28 PM, Raj Vishwanathan wrote:
>> Stijn,
>>
>> The first block of the data , is always stored in the local node.
>>Assuming that you had a replication factor of 3, the node that generates
>>the data will get about 10GB of data and the other 20GB will be
>>distributed among other nodes.
>>
>> Raj
>>
>>
>>
>>
>>
>>> ________________________________
>>> From: Stijn De Weirdt<stijn.dewei...@ugent.be>
>>> To: common-user@hadoop.apache.org
>>> Sent: Monday, April 2, 2012 9:54 AM
>>> Subject: data distribution in HDFS
>>>
>>> hi all,
>>>
>>> i'm just started to play around with hdfs+mapred. i'm currently
>>>playing with teragen/sort/validate to see if i understand all.
>>>
>>> the test setup involves 5 nodes that all are tasktracker and datanode
>>>(and one node that is also jobtracker and namenode on top of that.
>>>(this one node is running both the namenode hadoop process as the
>>>datanode process)
>>>
>>> when i do the in teragen run, the data is not distributed equally over
>>>all nodes. the node that is also namenode, get's a bigger portion of
>>>all the data. (as seen by df on the nodes and by using dsfadmin -report)
>>> i also get this distribution when i ran the TestDFSIO write test (50
>>>files of 1GB)
>>>
>>>
>>> i use basic command line  teragen $((100*1000*1000))
>>>/benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add
>>>the volumes in use by hdfs, it's actually quite a bit more.)
>>> 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in
>>>use. so this one datanode is seen as 2 nodes.
>>>
>>> when i do ls on the filesystem, i see that teragen created 250MB
>>>files, the current hdfs blocksize is 64MB.
>>>
>>> is there a reason why one datanode is preferred over the others.
>>> it is annoying since the terasort output behaves the same, and i can't
>>>use the full hdfs space for testing that way. also, since more IO comes
>>>to this one node, the performance isn't really balanced.
>>>
>>> many thanks,
>>>
>>> stijn
>>>
>>>
>>>
>

Re: data distribution in HDFS

Reply via email to