Il 15/03/2012 00:14, Suresh Srinivas ha scritto:
See my comments inline:
On Wed, Mar 14, 2012 at 9:24 AM, Giovanni Marzulli
<giovanni.marzu...@ba.infn.it <mailto:giovanni.marzu...@ba.infn.it>>
wrote:
Hello,
I'm trying HDFS on a small test cluster and I need to clarify some
doubts about hadoop behaviour.
Some details of my cluster:
Hadoop version: 0.20.2
I have two racks (rack1, rack2). Three datanodes for every rack.
Replication factor is set to 3.
"HDFS’s placement policy is to put one replica on one node in the
local rack, another on a node in a different (remote) rack, and
the last on a different node in the same remote rack."
Instead, I noticed that sometimes, a few blocks of files are
stored as follows: two replicas in the local rack and a replica in
a different rack. Are there exceptions that cause different
behaviour than default placement policy?
Your description of replica placement is correct. However a node
chosen based on this placement may not be a good target, due to the
traffic on the node, remaining space etc. See
BlockPlacementPolicyDefault#isGoodTarget(). Given the small cluster
size, you may be seeing different behavior based on load of individual
nodes, racks etc.
Likewise, at times some blocks are read from nodes in the remote
rack instead of nodes in the local rack. Why does it happen?
This is surprising. Not sure if the topology is correctly configired.
Another thing:if I have two datacenters and two racks for each of
them (so a hierarchical network topology), where tworemote
replicas arestored? Does Hadoop consider the hierarchy and stores
one replica in the local datacenter and two replicas in the other
datacenter? Or the two replicas are stored in a totally random rack?
Hadoop clusters are not spread across datacenters.
When I speak of datacenters, do just an example. I reformulate the question.
If I have this network topology:
/rackA/rack1
/rackA/rack2
/rackB/rack3
/rackB/rack4
and I write a file from a node in the rack2 (rackA). The first replica
will store on rack2; and where the others two replicas will be stored?
rackA, rackB or random rack? So, which is the placement policy in a
hierarchical network topology?
Regards,
Suresh