Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed

Allen Wittenauer Tue, 13 Jul 2010 09:52:24 -0700

When you write on a machine running a datanode process, the data is *always* 
written locally first.  This is to provide an optimization to the MapReduce 
framework.   The lesson here is that you should *never* use a datanode machine 
to load your data.  Always do it outside the grid.


Additionally, you can use fsck (filename) -files -locations -blocks to see 
where those blocks have been written.  

On Jul 13, 2010, at 9:45 AM, Nathan Grice wrote:

> To test the block distribution, run the same put command from the NameNode
> and then again from the DataNode.
> Check the HDFS filesystem after both commands. In my case, a 2GB file was
> distributed mostly evenly across the datanodes when put was run on the
> NameNode, and then put only on the DataNode where I ran the put command
> 
> On Tue, Jul 13, 2010 at 9:32 AM, C.V.Krishnakumar 
> <[email protected]>wrote:
> 
>> Hi,
>> I am a newbie. I am curious to know how you discovered that all the blocks
>> are written to datanode's hdfs? I thought the replication by namenode was
>> transparent. Am I missing something?
>> Thanks,
>> Krishna
>> On Jul 12, 2010, at 4:21 PM, Nathan Grice wrote:
>> 
>>> We are trying to load data into hdfs from one of the slaves and when the
>> put
>>> command is run from a slave(datanode) all of the blocks are written to
>> the
>>> datanode's hdfs, and not distributed to all of the nodes in the cluster.
>> It
>>> does not seem to matter what destination format we use ( /filename vs
>>> hdfs://master:9000/filename) it always behaves the same.
>>> Conversely, running the same command from the namenode distributes the
>> files
>>> across the datanodes.
>>> 
>>> Is there something I am missing?
>>> 
>>> -Nathan
>> 
>>

Re: using 'fs -put' from datanode: all data written to that node's hdfs and not distributed

Reply via email to