I recommend that you test your rack identification script, and test it under load. We encountered similar, seemingly random placement of files by HDFS and tracked the cause to this script. I hope this helps.
Sent from the desk of an overwhelmed engineer -----Original message----- From: Giovanni Marzulli <giovanni.marzu...@ba.infn.it> To: hdfs-user@hadoop.apache.org Sent: Fri, Mar 16, 2012 09:29:54 EDT Subject: Re: Questions about HDFS's placement policy Il 15/03/2012 00:14, Suresh Srinivas ha scritto: > See my comments inline: > > On Wed, Mar 14, 2012 at 9:24 AM, Giovanni Marzulli > <giovanni.marzu...@ba.infn.it <mailto:giovanni.marzu...@ba.infn.it>> > wrote: > > Hello, > > I'm trying HDFS on a small test cluster and I need to clarify some > doubts about hadoop behaviour. > > Some details of my cluster: > Hadoop version: 0.20.2 > I have two racks (rack1, rack2). Three datanodes for every rack. > Replication factor is set to 3. > > "HDFS’s placement policy is to put one replica on one node in the > local rack, another on a node in a different (remote) rack, and > the last on a different node in the same remote rack." > Instead, I noticed that sometimes, a few blocks of files are > stored as follows: two replicas in the local rack and a replica in > a different rack. Are there exceptions that cause different > behaviour than default placement policy? > > > Your description of replica placement is correct. However a node > chosen based on this placement may not be a good target, due to the > traffic on the node, remaining space etc. See > BlockPlacementPolicyDefault#isGoodTarget(). Given the small cluster > size, you may be seeing different behavior based on load of individual > nodes, racks etc. > > Likewise, at times some blocks are read from nodes in the remote > rack instead of nodes in the local rack. Why does it happen? > > > This is surprising. Not s