Re: [jira] Commented: (HADOOP-146) potential conflict in block id's, leading to data corruption

Eric Baldeschwieler Wed, 19 Apr 2006 09:14:35 -0700

we might want to do something generic to support future migrations,where you boot a new version of the FS next to the old one and thenjust move files across. If we can manage to use hardlinks to migratedata block files, it should go fast...


On Apr 18, 2006, at 4:39 PM, Doug Cutting (JIRA) wrote:

[ http://issues.apache.org/jira/browse/HADOOP-146?page=comments#action_12375011 ]
Doug Cutting commented on HADOOP-146:
-------------------------------------
I'd vote for sequential allocation. It will take a *really* longtime to cycle through all ids. Migration should not be expensive,since it just requires renaming block files, not copying them. Thehigh-watermark block id can be logged with the block->name table.
Here's one way to migrate: initially the high-water-mark id iszero. So all blocks in the name table are out-of-range, and henceneed renaming. Renaming can be handled like other blockwork: thenamenode can give datanodes rename commands. While a block isbeing renamed it must be kept in side tables, so that, e.g.,requests to read files whose blocks are partially renamed can stillbe handled.
potential conflict in block id's, leading to data corruption
------------------------------------------------------------

         Key: HADOOP-146
         URL: http://issues.apache.org/jira/browse/HADOOP-146
     Project: Hadoop
        Type: Bug
  Components: dfs
    Versions: 0.1.0, 0.1.1
    Reporter: Yoram Arnon
    Assignee: Konstantin Shvachko
     Fix For: 0.3
currently, block id's are generated randomly, and are not testedfor collisions with existing id's.while ids are 64 bits, given enough time and a large enough FS,collisions are expected.when a collision occurs, a random subset of blocks with that idwill be removed as extra replicas, and the contents of thatportion of the containing file are one random version of the block.to solve this one could check for id collision when creating a newblock, getting a new id in case of conflict. This approachrequires the name node to keep track of all existing block id's(rather than just the ones who have reported in), and to identifyold versions of a block id as in valid (in case a data node dies,a file is deleted, then a block id is reused for a new file).Alternatively, one could simply use sequential block id's. Herethe downsides are:1. migration from an existing file system is hard, requiringcompaction of the entire FS2. once you cycle through 64 bits of id's (quite a few years atfull blast), you're in trouble again (or run occasional/backgroundcompaction)
3. you must never lose the high watermark block id.
synchronized Block allocateBlock(UTF8 src) {
        Block b = new Block();
FileUnderConstruction v = (FileUnderConstruction)pendingCreates.get(src);
        v.add(b);
        pendingCreateBlocks.add(b);
        return b;
    }
static Random r = new Random();
    /**
     */
    public Block() {
        this.blkid = r.nextLong();
        this.len = 0;
    }
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of theadministrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [jira] Commented: (HADOOP-146) potential conflict in block id's, leading to data corruption

Reply via email to