Re: [Cluster-devel] [gfs2-utils PATCH 24/47] fsck.gfs2: Rework the undo functions

2013-05-16 Thread Steven Whitehouse
Hi,

This sounds to me like we are doing things in the wrong order. We
shouldn't need to undo things that have been done, otherwise we'll just
land up in a tangle,

Steve.

On Tue, 2013-05-14 at 11:21 -0500, Bob Peterson wrote:
 In pass1, it traverses the metadata free, processing each dinode and
 marking which blocks are used by that dinode. If a dinode is found
 to have unrecoverable errors, it does a bunch of work to undo the
 things it did. This is especially important for the processing of
 duplicate block references. Suppose dinode X references block 1234.
 Later in pass1, suppose a different dinode, Y, also references block
 1234. This is flagged as a duplicate block reference. Still later,
 suppose pass1 determines dinode Y is bad. Now it has to undo the
 work it did. It needs to properly unmark the data and metadata
 blocks it marked as no longer free so that valid references that
 follow aren't flagged as duplicate references. At the same time,
 it needs to unflag block 1234 as a duplicate reference as well, so
 that dinode X's original reference is still considered valid.
 
 Before this patch, fsck.gfs2 was trying to traverse the entire
 metadata tree for the bad dinode, trying to undo the designations.
 That becomes a huge problem if the damage was discovered in the
 middle of the metadata, in which case it may never have flagged any
 of the data blocks as in use as data in its blockmap. The result
 of undoing the designations sometimes resulted in blocks improperly
 being marked as free when they were, in fact, referenced by other
 valid dinodes.
 
 For example, suppose corrupt dinode Y references metadata blocks
 1234, 1235, and 1236. Now suppose a serious problem is found as part
 of its processing of block 1234, and so it stopped its metadata tree
 traversal there. Metadata blocks 1235 and 1236 are still listed as
 metadata for the bad dinode, but if we traverse the entire tree,
 those two blocks may be improperly processed. If another dinode
 actually uses blocks 1235 or 1236, the improper undo processing
 of those two blocks can screw up the valid references.
 
 This patch reworks the undo functions so that the undo functions
 don't get called on the entire metadata and data of the defective
 dinode. Instead, only the metadata and data blocks queued onto the
 metadata list are processed. This should ensure that the undo
 functions only operate on blocks that were processed in the first
 place.
 
 rhbz#902920
 ---
  gfs2/fsck/metawalk.c | 109 ++--
  gfs2/fsck/metawalk.h |   4 ++
  gfs2/fsck/pass1.c| 172 
 ---
  3 files changed, 135 insertions(+), 150 deletions(-)
 
 diff --git a/gfs2/fsck/metawalk.c b/gfs2/fsck/metawalk.c
 index d1b12f1..b9d9f89 100644
 --- a/gfs2/fsck/metawalk.c
 +++ b/gfs2/fsck/metawalk.c
 @@ -1259,7 +1259,7 @@ static int build_and_check_metalist(struct gfs2_inode 
 *ip, osi_list_t *mlp,
   if (err  0) {
   stack;
   error = err;
 - goto fail;
 + return error;
   }
   if (err  0) {
   if (!error)
 @@ -1278,14 +1278,11 @@ static int build_and_check_metalist(struct gfs2_inode 
 *ip, osi_list_t *mlp,
   }
   if (!nbh)
   nbh = bread(ip-i_sbd, block);
 - osi_list_add(nbh-b_altlist, cur_list);
 + osi_list_add_prev(nbh-b_altlist, cur_list);
   } /* for all data on the indirect block */
   } /* for blocks at that height */
   } /* for height */
 - return error;
 -fail:
 - free_metalist(ip, mlp);
 - return error;
 + return 0;
  }
  
  /**
 @@ -1331,6 +1328,27 @@ static int check_data(struct gfs2_inode *ip, struct 
 metawalk_fxns *pass,
   return error;
  }
  
 +static int undo_check_data(struct gfs2_inode *ip, struct metawalk_fxns *pass,
 +uint64_t *ptr_start, char *ptr_end)
 +{
 + int rc = 0;
 + uint64_t block, *ptr;
 +
 + /* If there isn't much pointer corruption check the pointers */
 + for (ptr = ptr_start ; (char *)ptr  ptr_end  !fsck_abort; ptr++) {
 + if (!*ptr)
 + continue;
 +
 + if (skip_this_pass || fsck_abort)
 + return 1;
 + block =  be64_to_cpu(*ptr);
 + rc = pass-undo_check_data(ip, block, pass-private);
 + if (rc  0)
 + return rc;
 + }
 + return 0;
 +}
 +
  static int hdr_size(struct gfs2_buffer_head *bh, int height)
  {
   if (height  1) {
 @@ -1363,6 +1381,7 @@ int check_metatree(struct gfs2_inode *ip, struct 
 metawalk_fxns *pass)
   int  i, 

Re: [Cluster-devel] [gfs2-utils PATCH 24/47] fsck.gfs2: Rework the undo functions

2013-05-16 Thread Bob Peterson
- Original Message -
| Hi,
| 
| This sounds to me like we are doing things in the wrong order. We
| shouldn't need to undo things that have been done, otherwise we'll just
| land up in a tangle,
| 
| Steve.

Hi,

Pass1's job is to traverse the metadata tree of every dinode, marking
which blocks are metadata, which are data, which are ext. attributes, etc.
With its current design, it runs through that tree once (for each dinode),
marking the blocks as it goes in its blockmap. If it encounters damage it
can't recover from, it has to undo those designations, otherwise you
end up in situations where a severely damaged dinode causes a lot of
collateral damage because it references blocks that are in use by a
newer, healthier dinode with valid references.

The alternative is to run through each dinode's metadata tree twice:
Once to determine its general health, and a second time to remember the
blocks it used in the blockmap. This obviously would be a lot slower.
The slowness would affect every dinode, healthy or damaged, whereas the
current method only takes extra time for damaged dinodes.

This ability to undo blockmap designations is not new to fsck.gfs2.
It's been doing that for many releases. Recent patches just restructured
it a bit to make better decisions and only affect pass1.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [gfs2-utils PATCH 24/47] fsck.gfs2: Rework the undo functions

2013-05-16 Thread Steven Whitehouse
Hi,

On Thu, 2013-05-16 at 11:02 -0400, Bob Peterson wrote:
 - Original Message -
 | Yes, but the undo side of things worries me... it is very easy to get
 | tied in knots doing that. The question is what is damage it can't
 | recover from? this is a bit vague and doesn't really explain what is
 | going on here.
 | 
 | I don't yet understand why we'd need to run through each inodes metadata
 | tree more than once in this case,
 | 
 | Steve.
 
 Hi,
 
 One thing to bear in mind is that the fsck blockmap is supposed to
 represent the correct state of all blocks in the on-disk bitmap.
 The job of pass1 is to build the blockmap, which starts out entirely free.
 As the metadata is traversed, the blocks are filled in with the appropriate
 type.
 
Yes, but this cannot be true, as we don't actually always know the true
correct state of the blocks, so sometimes, blocks will need to be marked
as unknown until more evidence is available, as I think your example
shows.

 The job of pass5 is to synchronize the on-disk bitmap to the blockmap.
 
 So we must ensure that the blockmap is accurate at ALL times after pass1.
 One of the primary checks pass1 does is to make sure that a block is free
 in the blockmap before changing its designation, otherwise it's a duplicate
 block reference that must be resolved in pass1b.
 
 Here's an example: Suppose you have a file with di_height==2, two levels of
 indirection. Suppose the dinode is layed out something like this:
 
   dinode   indirect  data
   --    --
   0x1000 - dinode
   --- 0x1001
   --- 0x1002
   --- 0x1003
   ...
   --- 0x1010
   --- 0x1011
   --- 0x1012
   --- 0x1013
   ...
   --- 0x1020
   --- 0x1021
   --- 0x1022
   --- 0x1023
   --- 0x777
   --- 0x1025
   ...
   --- 0x1030
 
 Now let's further suppose that this file was supposed to be deleted,
 and many of its blocks were in fact reused by a newer, valid dinode,
 but somehow, the bitmap was corrupted into saying this dinode is still
 alive (a dinode, not free or unlinked).
 
 For the sake of argument, say that second dinode appears later in the
 bitmap, so pass1 gets to corrupt dinode 0x1000 before it gets to the
 valid dinode that correctly references the blocks.
 
 As it traverses the metadata tree, it builds an array of lists, one for
 each height. Each item in the linked list corresponds to a metadata block.
 So pass1 traverses the array, marks down in its blockmap that block 0x1000 is
 dinode, blocks 0x1001, 0x1011, and 0x1021 are metadata blocks. Then it 
 processes the data block pointers within the metadata blocks, marking
 0x1002, 0x1003, all the way up to 1023 as data blocks.
 When it hits the block 0x777, it determines that's out
 of range for the device, and therefore the data file has an unrecoverable
 data block error.
 
 At this point, it doesn't make sense to continue marking 0x1025 and beyond
 as referenced data blocks, because that will only make matters worse.
 
I'm not convinced. In that case you have a single reference to a single
out of range block. All that we need to do there is to reset the pointer
to 0 (unallocated) and thats it. There is no need to stop processing the
remainder of the block unless there is some reason to believe that this
was not just a one off issue.
 
 Now we've got a problem: Before we knew 0x1000 was corrupt, we marked all
 its references in the blockmap. We can't just delete the corrupt dinode
 because most of its blocks are in-use by that other dinode.
 
I think the confusion seems to stem from doing things in the wrong
order. What we should be doing is verifying the tree structure of the
filesystem first, and then after that is done, updating the bitmaps to
match, in case there is a mismatch between the actual fs structure and
the bitmaps.

Also, we can use the initial state of the bitmaps in order to find more
objects to look at (i.e. in case of unlinked inodes) and also use
pointers in the filesystem in the same way (in case of directory entries
which point to non-inode blocks) for example. But neither of those
things requires undoing anything so far as I can see.

So what we ought to have is something like this:

 - Start looking at bitmaps to find initial set of things to check
 - Start checking inodes, adding additional blocks to list of things to
check as we go
 - Once finished, check bitmaps against list to ensure consistency
against the tree

Obviously that is rather simplified since there are a few extras we need
to deal with some corner cases, but that should be the core of it. If
each rgrp is checked as we go, then it should be possible to do with
just one pass through the fs.

And I know that we will not be able to do that immediately, but 

[Cluster-devel] qdisk - memcpy incorrect(?)

2013-05-16 Thread Neale Ferguson
Hi,
 In diskRawRead in disk.c there is the following code:

   readret = posix_memalign((void **)alignedBuf, disk-d_pagesz, 
disk-d_blksz);
if (readret  0) {
return -1;
}

io_state(STATE_READ);
readret = read(disk-d_fd, alignedBuf, readlen);
io_state(STATE_NONE);
if (readret  0) {
if (readret  len) {
memcpy(alignedBuf, buf, len);
readret = len;
} else {
memcpy(alignedBuf, buf, readret);
}
}

free(alignedBuf);

The memcpy() above have the src/dst operands swapped. We read into alignedBuf 
and are supposed to copy to buf. I’m not sure why qdiskd works sometimes and 
not others.

--- cluster-3.0.12.1/cman/qdisk/disk.c2013/05/16 16:45:491.1
+++ cluster-3.0.12.1/cman/qdisk/disk.c2013/05/16 16:46:29
@@ -430,14 +430,14 @@
 io_state(STATE_READ);
 readret = read(disk-d_fd, alignedBuf, readlen);
 io_state(STATE_NONE);
 if (readret  0) {
 if (readret  len) {
-memcpy(alignedBuf, buf, len);
+memcpy(buf, alignedBuf, len);
 readret = len;
 } else {
-memcpy(alignedBuf, buf, readret);
+memcpy(buf, alignedBuf, readret);
 }
 }

 free(alignedBuf);
 if (readret != len) {

Neale


Re: [Cluster-devel] qdisk - memcpy incorrect(?)

2013-05-16 Thread Fabio M. Di Nitto
This is already fixed in more recent releases. See commit:

8edb0d0eb31d94b8a3ba81f6d5b4c398accc950d

your patch also misses another incorrect in diskRawWrite.

Fabio

On 05/16/2013 08:00 PM, Neale Ferguson wrote:
 Hi,
  In diskRawRead in disk.c there is the following code:
 
readret = posix_memalign((void **)alignedBuf, disk-d_pagesz,
 disk-d_blksz);
 if (readret  0) {
 return -1;
 }
 
 io_state(STATE_READ);
 readret = read(disk-d_fd, alignedBuf, readlen);
 io_state(STATE_NONE);
 if (readret  0) {
 if (readret  len) {
 memcpy(alignedBuf, buf, len);
 readret = len;
 } else {
 memcpy(alignedBuf, buf, readret);
 }
 }
 
 free(alignedBuf);
 
 The memcpy() above have the src/dst operands swapped. We read into
 alignedBuf and are supposed to copy to buf. I’m not sure why qdiskd
 works sometimes and not others.  
 
 --- cluster-3.0.12.1/cman/qdisk/disk.c2013/05/16 16:45:491.1
 +++ cluster-3.0.12.1/cman/qdisk/disk.c2013/05/16 16:46:29
 @@ -430,14 +430,14 @@
  io_state(STATE_READ);
  readret = read(disk-d_fd, alignedBuf, readlen);
  io_state(STATE_NONE);
  if (readret  0) {
  if (readret  len) {
 -memcpy(alignedBuf, buf, len);
 +memcpy(buf, alignedBuf, len);
  readret = len;
  } else {
 -memcpy(alignedBuf, buf, readret);
 +memcpy(buf, alignedBuf, readret);
  }
  }
  
  free(alignedBuf);
  if (readret != len) {
 
 Neale