[ https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829860#action_12829860 ]
Konstantin Shvachko commented on HDFS-898: ------------------------------------------ h2. Approach #2 Here is another idea initiated by Rob Chansler, which will potentially let us free all positive longs for future block id generation. Let's set 8 or less high bits of existing block ids to 1. There is a high probability there will be no id collisions in the resulting set. More formally. We have a collection of blocks <b ~i~>. And we build a new collection <c ~i~ = Mask | b ~i~>, where Mask = ff00 0000 0000 0000. Collection <b ~i~> does not have repetitions, and there is a very good chance that collection <c ~i~> also does not have them. The intuition here is that if you have randomly distributed points in n-dimensional space and you project them into (n-1)-dimensional sub-space, say along one of the axis, the probability that two points will be projected into the same image is low. I am attaching a document which derives a formula for the probability. The formula was independently confirmed by Nicholas. The bottom line is this: - The probability of getting a collision when setting the first 8 bits to 1 is < 0.03065 - The probability of getting a collision when setting only one high bit to 1 is < 0.0001221 >From practical point of view it is enough to set at least one highest bit. >Then we'll free the entire segment of positive longs for new block id >generation. I implemented a block id analyzing routine, which combines the two approaches described in the issue. The routine reads the image using OIV and first tries to reset bits starting from the high 8 and going down to 1 highest bit whenever resetting more bits leads to collisions. If the routine fails to reset any bits collision-free, then it uses the initial approach, that is, just finds the biggest free gap in existing block ids. If the bit reset approach succeeds then block replicas need to be renamed on data-nodes. This will be done via an upgrade. During the upgrade data-nodes hardlink replicas into the new directory (automatically handled by the upgrade framework), and then rename the newly created links reflecting the new block ids (the specific part of this upgrade). In order to avoid any mishaps I also propose to assign new generation stamp to all blocks renamed. So that when data-nodes that missed the upgrade wake up they will not report old blocks. The latter is impossible, because data-nodes cannot join the cluster until they upgrade, but we don't want to take any chances. I tested the routine on 6 images so far, containing from 150,000 to 40,000,000 blocks. All of them successfully tolerated the reset of 8 bits without ANY collisions. I would like to ask the community for some help. If people could run the tool on their images and post some stats or just send the results to me I'll take care of summarizing them. > Sequential generation of block ids > ---------------------------------- > > Key: HDFS-898 > URL: https://issues.apache.org/jira/browse/HDFS-898 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node > Affects Versions: 0.20.1 > Reporter: Konstantin Shvachko > Assignee: Konstantin Shvachko > Fix For: 0.22.0 > > Attachments: HighBitProjection.pdf > > > This is a proposal to replace random generation of block ids with a > sequential generator in order to avoid block id reuse in the future. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.