----- Original Message ----- | Yes, but the undo side of things worries me... it is very easy to get | tied in knots doing that. The question is what is "damage it can't | recover from"? this is a bit vague and doesn't really explain what is | going on here. | | I don't yet understand why we'd need to run through each inodes metadata | tree more than once in this case, | | Steve.
Hi, One thing to bear in mind is that the fsck blockmap is supposed to represent the correct state of all blocks in the on-disk bitmap. The job of pass1 is to build the blockmap, which starts out entirely "free". As the metadata is traversed, the blocks are filled in with the appropriate type. The job of pass5 is to synchronize the on-disk bitmap to the blockmap. So we must ensure that the blockmap is accurate at ALL times after pass1. One of the primary checks pass1 does is to make sure that a block is "free" in the blockmap before changing its designation, otherwise it's a duplicate block reference that must be resolved in pass1b. Here's an example: Suppose you have a file with di_height==2, two levels of indirection. Suppose the dinode is layed out something like this: dinode indirect data ------ -------- ------ 0x1000 - dinode ---> 0x1001 ---> 0x1002 ---> 0x1003 ... ---> 0x1010 ---> 0x1011 ---> 0x1012 ---> 0x1013 ... ---> 0x1020 ---> 0x1021 ---> 0x1022 ---> 0x1023 ---> 0x7777777777777777777 ---> 0x1025 ... ---> 0x1030 Now let's further suppose that this file was supposed to be deleted, and many of its blocks were in fact reused by a newer, valid dinode, but somehow, the bitmap was corrupted into saying this dinode is still alive (a dinode, not free or unlinked). For the sake of argument, say that second dinode appears later in the bitmap, so pass1 gets to corrupt dinode 0x1000 before it gets to the valid dinode that correctly references the blocks. As it traverses the metadata tree, it builds an array of lists, one for each height. Each item in the linked list corresponds to a metadata block. So pass1 traverses the array, marks down in its blockmap that block 0x1000 is dinode, blocks 0x1001, 0x1011, and 0x1021 are metadata blocks. Then it processes the data block pointers within the metadata blocks, marking 0x1002, 0x1003, all the way up to 1023 as "data" blocks. When it hits the block 0x7777777777777777777, it determines that's out of range for the device, and therefore the data file has an unrecoverable data block error. At this point, it doesn't make sense to continue marking 0x1025 and beyond as referenced data blocks, because that will only make matters worse. Now we've got a problem: Before we knew 0x1000 was corrupt, we marked all its references in the blockmap. We can't just delete the corrupt dinode because most of its blocks are in-use by that other dinode. One strategy is to keep the blocks it previously marked as "data" and "meta" "as is" in the blockmap, mark the dinode as "invalid dinode" in the blockmap and move along. Later, when we get to the other valid dinode, we'll see potentially tens of thousands of duplicate references. Assuming we have enough memory to record all these references, and time enough to resolve them, these can all be checked in pass1b and resolved properly, due to the fact that we marked the dinode as "invalid" (we favor the valid reference). The problems with this strategy is (1) it takes lots of time and memory to record and resolve all these duplicate references, and (2) when it gets to pass5, the blocks that AREN'T referenced elsewhere are now set to "data" in the blockmap, so pass5 will set the bitmap accordingly. But a subsequent run of fsck.gfs2 will determine that no valid dinode references those data blocks, and it will complain about blocks improperly marked as "data" that should, in fact, be "free". This is bad behavior: a second run of fsck.gfs2 should come up clean. So to prevent this from happening, pass1, upon discovering the out-of-range block, makes an effort to "undo" its blockmap designations. It traverses the dinode's metadata tree once more, but sets all the blocks back to "free". Well, not all of them, because if the invalid dinode referenced blocks that were previously encountered, it would have recorded them as duplicate references, so it has to "undo" that designation as well. Another alternative is to do a pre-check for all possible types of corruption of every block. This involves making two passes through the metadata: The first pass validates the blocks are all valid, and there are absolutely no problems with data or metadata. The second pass marks all the blocks in the blockmap as their appropriate type. Yes, it gets very sticky, very messy. That's why it's taken so long to get it right. Regards, Bob Peterson Red Hat File Systems