[ https://issues.apache.org/jira/browse/HDFS-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886993#action_12886993 ]
Joydeep Sen Sarma commented on HDFS-1094: ----------------------------------------- :-) - of course i meant a node group = fixed set of nodes. and of course - replication is done within any three within a node group (and the node group is much larger than 3 of course). i understand there's a subtext to this - that we do want the replicas to span racks and that the writer writes one locally and one within the same rack. that's all good and consistent though. as long as the node group spans multiple racks and there are two from the writer's rack - we are good. i don't see how this is all that different from what the protocol u listed - where u are choosing nodes randomly subject to some constraints. the main difference is that i am suggesting minimizing the number of node groups. ur protocol implicitly has a finite number of node groups - but i am trying to argue that it's much much more than it needs to be. the math is pretty obvious (though tedious). the reduction in loss probabilities comes from the small node group. but it is offset by the larger number of node groups. if one doesn't minimize the number of node groups - we start losing out on the benefits of this scheme. good point about racks with different number of nodes. let me think about it. > Intelligent block placement policy to decrease probability of block loss > ------------------------------------------------------------------------ > > Key: HDFS-1094 > URL: https://issues.apache.org/jira/browse/HDFS-1094 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node > Reporter: dhruba borthakur > Assignee: Rodrigo Schmidt > Attachments: prob.pdf, prob.pdf > > > The current HDFS implementation specifies that the first replica is local and > the other two replicas are on any two random nodes on a random remote rack. > This means that if any three datanodes die together, then there is a > non-trivial probability of losing at least one block in the cluster. This > JIRA is to discuss if there is a better algorithm that can lower probability > of losing a block. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.