Re: [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process (v2)
Hi, On 21/11/18 18:52, Bob Peterson wrote: Hi, This is a second draft of a two-patch set to fix some of the nasty journal recovery problems I've found lately. The original post from 08 November had horribly bad and inaccurate comments, and Steve Whitehouse and Andreas Gruenbacher pointed out. This version is hopefully better and more accurately describes what the patches do and how they work. Also, I fixed a superblock flag that was improperly declared as a glock flag. Other than the renamed and re-valued superblock flag, the code remains unchanged from the previous version. It probably needs a bit more testing, but it seems to work well. --- The problems have to do with file system corruption caused when recovery replays a journal after the resource group blocks have been unlocked by the recovery process. In other words, when no cluster node takes responsibility to replay the journal of a withdrawing node, then it gets replayed later on, after the blocks contents have been changed. The first patch prevents gfs2 from attempting recovery if the file system is withdrawn or has journal IO errors. Trying to recover your own journal from either of these unstable conditions is dangerous and likely to corrupt the file system. The second patch is more extensive. When a node withdraws from a file system it signals all other nodes with the file system mounted to perform recovery on its journal, since it cannot safely recover its own journal. This is accomplished by a new non-disk callback glop used exclusively by the "live" glock, which sets up an lvb in the glock to indicate which journal(s) need to be recovered. Regards, Bob Peterson --- Bob Peterson (2): gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn gfs2: initiate journal recovery as soon as a node withdraws fs/gfs2/glock.c| 5 ++- fs/gfs2/glops.c| 47 +++ fs/gfs2/incore.h | 3 ++ fs/gfs2/lock_dlm.c | 95 ++ fs/gfs2/log.c | 62 -- fs/gfs2/super.c| 5 ++- fs/gfs2/super.h| 1 + fs/gfs2/util.c | 84 fs/gfs2/util.h | 13 +++ 9 files changed, 282 insertions(+), 33 deletions(-) Yes, that looks a bit cleaner now, Steve.
[Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process (v2)
Hi, This is a second draft of a two-patch set to fix some of the nasty journal recovery problems I've found lately. The original post from 08 November had horribly bad and inaccurate comments, and Steve Whitehouse and Andreas Gruenbacher pointed out. This version is hopefully better and more accurately describes what the patches do and how they work. Also, I fixed a superblock flag that was improperly declared as a glock flag. Other than the renamed and re-valued superblock flag, the code remains unchanged from the previous version. It probably needs a bit more testing, but it seems to work well. --- The problems have to do with file system corruption caused when recovery replays a journal after the resource group blocks have been unlocked by the recovery process. In other words, when no cluster node takes responsibility to replay the journal of a withdrawing node, then it gets replayed later on, after the blocks contents have been changed. The first patch prevents gfs2 from attempting recovery if the file system is withdrawn or has journal IO errors. Trying to recover your own journal from either of these unstable conditions is dangerous and likely to corrupt the file system. The second patch is more extensive. When a node withdraws from a file system it signals all other nodes with the file system mounted to perform recovery on its journal, since it cannot safely recover its own journal. This is accomplished by a new non-disk callback glop used exclusively by the "live" glock, which sets up an lvb in the glock to indicate which journal(s) need to be recovered. Regards, Bob Peterson --- Bob Peterson (2): gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn gfs2: initiate journal recovery as soon as a node withdraws fs/gfs2/glock.c| 5 ++- fs/gfs2/glops.c| 47 +++ fs/gfs2/incore.h | 3 ++ fs/gfs2/lock_dlm.c | 95 ++ fs/gfs2/log.c | 62 -- fs/gfs2/super.c| 5 ++- fs/gfs2/super.h| 1 + fs/gfs2/util.c | 84 fs/gfs2/util.h | 13 +++ 9 files changed, 282 insertions(+), 33 deletions(-) -- 2.19.1