Re: data distribution in HDFS

Raj Vishwanathan Mon, 02 Apr 2012 13:00:25 -0700

AFAIK there is no way to disable this "feature" . This is an optimization. It 
happens because in your case the node generating the data is also a data node.


Raj



>________________________________
> From: Stijn De Weirdt <stijn.dewei...@ugent.be>
>To: common-user@hadoop.apache.org 
>Sent: Monday, April 2, 2012 12:18 PM
>Subject: Re: data distribution in HDFS
> 
>thanks serge.
>
>
>is there a way to disable this "feature" (ie place first block always on 
>local node)?
>and is this because the local node is a datanode? or is there always a 
>"local node" with datatransfers?
>
>many thanks,
>
>stijn
>
>> Local node is a node from where you are coping data from
>>
>> If lets say you are using -copyFromLocal option
>>
>>
>> Regards
>> Serge
>>
>> On 4/2/12 11:53 AM, "Stijn De Weirdt"<stijn.dewei...@ugent.be>  wrote:
>>
>>> hi raj,
>>>
>>> what is a "local node"? is it relative to the tasks that are started?
>>>
>>>
>>> stijn
>>>
>>> On 04/02/2012 07:28 PM, Raj Vishwanathan wrote:
>>>> Stijn,
>>>>
>>>> The first block of the data , is always stored in the local node.
>>>> Assuming that you had a replication factor of 3, the node that generates
>>>> the data will get about 10GB of data and the other 20GB will be
>>>> distributed among other nodes.
>>>>
>>>> Raj
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> ________________________________
>>>>> From: Stijn De Weirdt<stijn.dewei...@ugent.be>
>>>>> To: common-user@hadoop.apache.org
>>>>> Sent: Monday, April 2, 2012 9:54 AM
>>>>> Subject: data distribution in HDFS
>>>>>
>>>>> hi all,
>>>>>
>>>>> i'm just started to play around with hdfs+mapred. i'm currently
>>>>> playing with teragen/sort/validate to see if i understand all.
>>>>>
>>>>> the test setup involves 5 nodes that all are tasktracker and datanode
>>>>> (and one node that is also jobtracker and namenode on top of that.
>>>>> (this one node is running both the namenode hadoop process as the
>>>>> datanode process)
>>>>>
>>>>> when i do the in teragen run, the data is not distributed equally over
>>>>> all nodes. the node that is also namenode, get's a bigger portion of
>>>>> all the data. (as seen by df on the nodes and by using dsfadmin -report)
>>>>> i also get this distribution when i ran the TestDFSIO write test (50
>>>>> files of 1GB)
>>>>>
>>>>>
>>>>> i use basic command line  teragen $((100*1000*1000))
>>>>> /benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add
>>>>> the volumes in use by hdfs, it's actually quite a bit more.)
>>>>> 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in
>>>>> use. so this one datanode is seen as 2 nodes.
>>>>>
>>>>> when i do ls on the filesystem, i see that teragen created 250MB
>>>>> files, the current hdfs blocksize is 64MB.
>>>>>
>>>>> is there a reason why one datanode is preferred over the others.
>>>>> it is annoying since the terasort output behaves the same, and i can't
>>>>> use the full hdfs space for testing that way. also, since more IO comes
>>>>> to this one node, the performance isn't really balanced.
>>>>>
>>>>> many thanks,
>>>>>
>>>>> stijn
>>>>>
>>>>>
>>>>>
>>>
>>
>>
>
>
>
>

Re: data distribution in HDFS

Reply via email to