[ 
https://issues.apache.org/jira/browse/HDFS-898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829860#action_12829860
 ] 

Konstantin Shvachko commented on HDFS-898:
------------------------------------------

h2. Approach #2

Here is another idea initiated by Rob Chansler, which will potentially let us 
free all positive longs for future block id generation.
Let's set 8 or less high bits of existing block ids to 1. There is a high 
probability there will be no id collisions in the resulting set.
More formally. We have a collection of blocks <b ~i~>.  And we build a new 
collection <c ~i~ = Mask | b ~i~>, where Mask = ff00 0000 0000 0000. Collection 
<b ~i~> does not have repetitions, and there is a very good chance that 
collection <c ~i~> also does not have them.
The intuition here is that if you have randomly distributed points in 
n-dimensional space and you project them into (n-1)-dimensional sub-space, say 
along one of the axis, the probability that two points will be projected into 
the same image is low.

I am attaching a document which derives a formula for the probability. The 
formula was independently confirmed by Nicholas. The bottom line is this:
- The probability of getting a collision when setting the first 8 bits to 1 is 
< 0.03065
- The probability of getting a collision when setting only one high bit to 1 is 
< 0.0001221

>From practical point of view it is enough to set at least one highest bit. 
>Then we'll free the entire segment of positive longs for new block id 
>generation.

I implemented a block id analyzing routine, which combines the two approaches 
described in the issue. The routine reads the image using OIV and first tries 
to reset bits starting from the high 8 and going down to 1 highest bit whenever 
resetting more bits leads to collisions. If the routine fails to reset any bits 
collision-free, then it uses the initial approach, that is, just finds the 
biggest free gap in existing block ids.

If the bit reset approach succeeds then block replicas need to be renamed on 
data-nodes. This will be done via an upgrade. During the upgrade data-nodes 
hardlink replicas into the new directory (automatically handled by the upgrade 
framework), and then rename the newly created links reflecting the new block 
ids (the specific part of this upgrade).
In order to avoid any mishaps I also propose to assign new generation stamp to 
all blocks renamed. So that when data-nodes that missed the upgrade wake up 
they will not report old blocks. The latter is impossible, because data-nodes 
cannot join the cluster until they upgrade, but we don't want to take any 
chances.

I tested the routine on 6 images so far, containing from 150,000 to 40,000,000 
blocks. All of them successfully tolerated the reset of 8 bits without ANY 
collisions.

I would like to ask the community for some help. If people could run the tool 
on their images and post some stats or just send the results to me I'll take 
care of summarizing them.


> Sequential generation of block ids
> ----------------------------------
>
>                 Key: HDFS-898
>                 URL: https://issues.apache.org/jira/browse/HDFS-898
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>    Affects Versions: 0.20.1
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
>             Fix For: 0.22.0
>
>         Attachments: HighBitProjection.pdf
>
>
> This is a proposal to replace random generation of block ids with a 
> sequential generator in order to avoid block id reuse in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to