[ https://issues.apache.org/jira/browse/HDFS-1094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886976#action_12886976 ]
Joydeep Sen Sarma commented on HDFS-1094: ----------------------------------------- > - Pick random rack r2 that is within R racks from r > - Pick random machine m2 in r2 that is within window [i, (i+M-1)%racksize] a few points: - dangerous to choose physically contiguous racks for node groups (because of correlated failures in consecutive racks). may make things a lot worse. - if rack numbering is based on some arithmetic (so that logically contiguous is not physically contiguous) - then one has to reason about what happens when new rack is added (i think it's ok to leave existing replicated data as is - but it's worth talking about this case. what would the rebalancer do in this case?) - easy to reduce the overlap between node-groups (and thereby decrease loss probability): - instead of [i, (i+M-1)%racksize] - choose [ (i / (r/M))*r/M, (i / (r/M))*r/M + M-1] // fixed offset groups of M nodes each in a rack i glanced through the Ceph algorithm (http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf) - it doesn't try to do what's described here. the number of placement groups is not controlled (and that is not an objective of the algo.). a close analog of the problem here is seen RAID arrays. When choosing to do parity + mirroring to tolerate multiple disk failures - one can choose to mirror RAID-parity groups or apply parity over RAID mirrored groups (1+5 vs. 5+1). turns out 5+1 is a lot better from a data loss probability perspective. the reasoning and math are similar (both are susceptible to data loss on 4-disk failures - but in 5+1 - the 4-disk failures have to be contained within 2 2-disk mirrored pairs. this is combinatorially much harder than the 1+5 case - where the the 4 disk failures have to cause 2 failures each in the N/2 node groups). > Intelligent block placement policy to decrease probability of block loss > ------------------------------------------------------------------------ > > Key: HDFS-1094 > URL: https://issues.apache.org/jira/browse/HDFS-1094 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node > Reporter: dhruba borthakur > Assignee: Rodrigo Schmidt > Attachments: prob.pdf, prob.pdf > > > The current HDFS implementation specifies that the first replica is local and > the other two replicas are on any two random nodes on a random remote rack. > This means that if any three datanodes die together, then there is a > non-trivial probability of losing at least one block in the cluster. This > JIRA is to discuss if there is a better algorithm that can lower probability > of losing a block. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.