Re: [Cluster-devel] [PATCH V15 14/18] block: enable multipage bvecs
On Wed, Feb 27, 2019 at 08:47:09PM +, Jon Hunter wrote: > > On 21/02/2019 08:42, Marek Szyprowski wrote: > > Dear All, > > > > On 2019-02-15 12:13, Ming Lei wrote: > >> This patch pulls the trigger for multi-page bvecs. > >> > >> Reviewed-by: Omar Sandoval > >> Signed-off-by: Ming Lei > > > > Since Linux next-20190218 I've observed problems with block layer on one > > of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting > > this issue led me to this change. This is also the first linux-next > > release with this change merged. The issue is fully reproducible and can > > be observed in the following kernel log: > > > > sdhci: Secure Digital Host Controller Interface driver > > sdhci: Copyright(c) Pierre Ossman > > s3c-sdhci 1253.sdhci: clock source 2: mmc_busclk.2 (1 Hz) > > s3c-sdhci 1253.sdhci: Got CD GPIO > > mmc0: SDHCI controller on samsung-hsmmc [1253.sdhci] using ADMA > > mmc0: new high speed SDHC card at address > > mmcblk0: mmc0: SL16G 14.8 GiB > I have also noticed some failures when writing to an eMMC device on one > of our Tegra boards. We have a simple eMMC write/read test and it is > currently failing because the data written does not match the source. > > I did not seem the same crash as reported here, however, in our case the > rootfs is NFS mounted and so probably would not. However, the bisect > points to this commit and reverting on top of -next fixes the issues. It is sdhci, probably related with max segment size, could you test the following patch: https://marc.info/?l=linux-mmc=155128334122951=2 Thanks, Ming
[Cluster-devel] [PATCH 11/15] gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty
Before this patch, if gfs2_ail_empty_gl saw there was nothing on the ail list, it would return and not flush the log. The problem is that there could still be a revoke for the rgrp sitting on the sd_log_le_revoke list that's been recently taken off the ail list. But that revoke still needs to be written, and the rgrp_go_inval still needs to call log_flush_wait to ensure the revokes are all properly written to the journal before we relinquish control of the glock to another node. If we give the glock to another node before we have this knowledge, the node might crash and its journal replayed, in which case the missing revoke would allow the journal replay to replay the rgrp over top of the rgrp we already gave to another node, thus overwriting its changes and corrupting the file system. This patch makes gfs2_ail_empty_gl still call gfs2_log_flush rather than returning. Signed-off-by: Bob Peterson --- fs/gfs2/glops.c | 20 +++- fs/gfs2/log.c | 2 +- fs/gfs2/log.h | 1 + 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c index 2cdd6351e017..fdcb5809995f 100644 --- a/fs/gfs2/glops.c +++ b/fs/gfs2/glops.c @@ -94,8 +94,25 @@ static void gfs2_ail_empty_gl(struct gfs2_glock *gl) INIT_LIST_HEAD(_databuf); tr.tr_revokes = atomic_read(>gl_ail_count); - if (!tr.tr_revokes) + if (!tr.tr_revokes) { + /** +* We have nothing on the ail, but there could be revokes on +* the sdp revoke queue, in which case, we still want to flush +* the log and wait for it to finish. +* +* If the sdp revoke list is empty too, we might still have an +* io outstanding for writing revokes, so we should wait for +* it before returning. +* +* If none of these conditions are true, our revokes are all +* flushed and we can return. +*/ + if (!list_empty(>sd_log_le_revoke)) + goto flush; + else if (atomic_read(>sd_log_in_flight)) + log_flush_wait(sdp); return; + } /* A shortened, inline version of gfs2_trans_begin() * tr->alloced is not set since the transaction structure is @@ -110,6 +127,7 @@ static void gfs2_ail_empty_gl(struct gfs2_glock *gl) __gfs2_ail_flush(gl, 0, tr.tr_revokes); gfs2_trans_end(sdp); +flush: gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL | GFS2_LFC_AIL_EMPTY_GL); } diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c index 69077f92d703..0def6343e618 100644 --- a/fs/gfs2/log.c +++ b/fs/gfs2/log.c @@ -513,7 +513,7 @@ static void log_pull_tail(struct gfs2_sbd *sdp, unsigned int new_tail) } -static void log_flush_wait(struct gfs2_sbd *sdp) +void log_flush_wait(struct gfs2_sbd *sdp) { DEFINE_WAIT(wait); diff --git a/fs/gfs2/log.h b/fs/gfs2/log.h index 1bc9bd444b28..bd2d08d0f21c 100644 --- a/fs/gfs2/log.h +++ b/fs/gfs2/log.h @@ -75,6 +75,7 @@ extern void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 type); extern void gfs2_log_commit(struct gfs2_sbd *sdp, struct gfs2_trans *trans); extern void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct writeback_control *wbc); +extern void log_flush_wait(struct gfs2_sbd *sdp); extern void gfs2_log_shutdown(struct gfs2_sbd *sdp); extern int gfs2_logd(void *data); -- 2.20.1
[Cluster-devel] [PATCH 02/15] gfs2: Introduce concept of a pending withdraw
File system withdraws can be delayed when inconsistencies are discovered when we cannot withdraw immediately, for example, when critical spin_locks are held. But delaying the withdraw can cause gfs2 to ignore the error and keep running for a short period of time. For example, an rgrp glock may be dequeued and demoted while there are still buffers that haven't been properly revoked, due to io errors writing to the journal. This patch introduces a new concept of a delayed withdraw, which means an inconsistency has been discovered and we need to withdraw at the earliest possible opportunity. In these cases, we aren't quite withdrawn yet, but we still need to not dequeue glocks and other critical things. If we dequeue the glocks and the withdraw results in our journal being replayed, the replay could overwrite data that's been modified by a different node that acquired the glock in the meantime. Signed-off-by: Bob Peterson --- fs/gfs2/aops.c | 4 ++-- fs/gfs2/file.c | 2 +- fs/gfs2/glock.c | 7 +++ fs/gfs2/glops.c | 2 +- fs/gfs2/incore.h | 1 + fs/gfs2/log.c| 20 fs/gfs2/meta_io.c| 6 +++--- fs/gfs2/ops_fstype.c | 3 +-- fs/gfs2/quota.c | 2 +- fs/gfs2/super.c | 6 +++--- fs/gfs2/sys.c| 2 +- fs/gfs2/util.c | 1 + fs/gfs2/util.h | 8 13 files changed, 34 insertions(+), 30 deletions(-) diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index 05dd78f4b2b3..0d3cde8a61cd 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -521,7 +521,7 @@ static int __gfs2_readpage(void *file, struct page *page) error = mpage_readpage(page, gfs2_block_map); } - if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags))) + if (unlikely(withdrawn(sdp))) return -EIO; return error; @@ -638,7 +638,7 @@ static int gfs2_readpages(struct file *file, struct address_space *mapping, gfs2_glock_dq(); out_uninit: gfs2_holder_uninit(); - if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags))) + if (unlikely(withdrawn(sdp))) ret = -EIO; return ret; } diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index a2dea5bc0427..4046f6ac7f13 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -1169,7 +1169,7 @@ static int gfs2_lock(struct file *file, int cmd, struct file_lock *fl) cmd = F_SETLK; fl->fl_type = F_UNLCK; } - if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags))) { + if (unlikely(withdrawn(sdp))) { if (fl->fl_type == F_UNLCK) locks_lock_file_wait(file, fl); return -EIO; diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index f66773c71bcd..c6d6e478f5e3 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -542,7 +542,7 @@ __acquires(>gl_lockref.lock) unsigned int lck_flags = (unsigned int)(gh ? gh->gh_flags : 0); int ret; - if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags)) && + if (unlikely(withdrawn(sdp)) && target != LM_ST_UNLOCKED) return; lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP | @@ -579,8 +579,7 @@ __acquires(>gl_lockref.lock) } else if (ret) { fs_err(sdp, "lm_lock ret %d\n", ret); - GLOCK_BUG_ON(gl, !test_bit(SDF_SHUTDOWN, - >sd_flags)); + GLOCK_BUG_ON(gl, !withdrawn(sdp)); } } else { /* lock_nolock */ finish_xmote(gl, target); @@ -1092,7 +1091,7 @@ int gfs2_glock_nq(struct gfs2_holder *gh) struct gfs2_sbd *sdp = gl->gl_name.ln_sbd; int error = 0; - if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags))) + if (unlikely(withdrawn(sdp))) return -EIO; if (test_bit(GLF_LRU, >gl_flags)) diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c index f15b4c57c4bd..9c86c8004ba7 100644 --- a/fs/gfs2/glops.c +++ b/fs/gfs2/glops.c @@ -539,7 +539,7 @@ static int freeze_go_xmote_bh(struct gfs2_glock *gl, struct gfs2_holder *gh) gfs2_consist(sdp); /* Initialize some head of the log stuff */ - if (!test_bit(SDF_SHUTDOWN, >sd_flags)) { + if (!withdrawn(sdp)) { sdp->sd_log_sequence = head.lh_sequence + 1; gfs2_log_pointers_init(sdp, head.lh_blkno); } diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index 76336b592030..c6984265807f 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -620,6 +620,7 @@ enum { SDF_RORECOVERY = 7, /* read only recovery */ SDF_SKIP_DLM_UNLOCK = 8, SDF_FORCE_AIL_FLUSH = 9, + SDF_PENDING_WITHDRAW= 10, /* Will withdraw eventually */ }; enum gfs2_freeze_state { diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c index
[Cluster-devel] [PATCH 08/15] gfs2: Force withdraw to replay journals and wait for it to finish
When a node withdraws from a file system, it often leaves its journal in an incomplete state. This is especially true when the withdraw is caused by io errors writing to the journal. Before this patch, a withdraw would try to write a "shutdown" record to the journal, tell dlm it's done with the file system, and none of the other nodes know about the problem. Later, when the problem is fixed and the withdrawn node is rebooted, it would then discover that its own journal was incomplete, and replay it. However, replaying it at this point is almost guaranteed to introduce corruption because the other nodes are likely to have used affected resource groups that appeared in the journal since the time of the withdraw. Replaying the journal later will overwrite any changes made, and not through any fault of dlm, which was instructed during the withdraw to release those resources. This patch makes file system withdraws seen by the entire cluster. Withdrawing nodes dequeue their journal glock to allow recovery. The remaining nodes check all the journals to see if they are clean or in need of replay. They try to replay dirty journals, but only the journals of withdrawn nodes will be "not busy" and therefore available for replay. Until the journal replay is complete, no i/o related glocks may be given out, to ensure that the replay does not cause the aforementioned corruption: We cannot allow any journal replay to overwrite blocks associated with a glock once it is held. The glocks not affected by a withdraw are permitted to be passed around as normal during a withdraw. A new glops flag, called GLOF_OK_AT_WITHDRAW, indicates glocks that may be passed around freely while a withdraw is taking place. One such glock is the "live" glock which is now used to signal when a withdraw occurs. When a withdraw occurs, the node signals its withdraw by dequeueing the "live" glock and trying to enqueue it in EX mode, thus forcing the other nodes to all see a demote request, by way of a "1CB" (one callback) try lock. The "live" glock is not granted in EX; the callback is only just used to indicate a withdraw has occurred. Note that all nodes in the cluster must wait for the recovering node to finish replaying the withdrawing node's journal before continuing. To this end, it checks that the journals are clean multiple times in a retry loop. Signed-off-by: Bob Peterson --- fs/gfs2/glock.c | 30 +++-- fs/gfs2/glock.h | 1 + fs/gfs2/glops.c | 52 + fs/gfs2/incore.h | 5 +++ fs/gfs2/lock_dlm.c | 32 + fs/gfs2/meta_io.c| 2 +- fs/gfs2/ops_fstype.c | 4 +- fs/gfs2/super.c | 24 +- fs/gfs2/super.h | 1 + fs/gfs2/util.c | 105 ++- 10 files changed, 240 insertions(+), 16 deletions(-) diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index 25923b9e18ef..ba61bba46785 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -242,7 +242,8 @@ static void __gfs2_glock_put(struct gfs2_glock *gl) gfs2_glock_remove_from_lru(gl); spin_unlock(>gl_lockref.lock); GLOCK_BUG_ON(gl, !list_empty(>gl_holders)); - GLOCK_BUG_ON(gl, mapping && mapping->nrpages); + GLOCK_BUG_ON(gl, mapping && mapping->nrpages && +!test_bit(SDF_SHUTDOWN, >sd_flags)); trace_gfs2_glock_put(gl); sdp->sd_lockstruct.ls_ops->lm_put_lock(gl); } @@ -544,6 +545,7 @@ __acquires(>gl_lockref.lock) if (unlikely(withdrawn(sdp)) && !(glops->go_flags & GLOF_OK_AT_WITHDRAW) && + (gh && !(LM_FLAG_NOEXP & gh->gh_flags)) && target != LM_ST_UNLOCKED) return; lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP | @@ -1092,8 +1094,8 @@ int gfs2_glock_nq(struct gfs2_holder *gh) struct gfs2_sbd *sdp = gl->gl_name.ln_sbd; int error = 0; - if (unlikely(withdrawn(sdp)) && - !(gl->gl_ops->go_flags & GLOF_OK_AT_WITHDRAW)) + if (unlikely(withdrawn(sdp) && !(LM_FLAG_NOEXP & gh->gh_flags) && +!(gl->gl_ops->go_flags & GLOF_OK_AT_WITHDRAW))) return -EIO; if (test_bit(GLF_LRU, >gl_flags)) @@ -1137,11 +1139,28 @@ int gfs2_glock_poll(struct gfs2_holder *gh) void gfs2_glock_dq(struct gfs2_holder *gh) { struct gfs2_glock *gl = gh->gh_gl; + struct gfs2_sbd *sdp = gl->gl_name.ln_sbd; const struct gfs2_glock_operations *glops = gl->gl_ops; unsigned delay = 0; int fast_path = 0; spin_lock(>gl_lockref.lock); + /** +* If we're in the process of file system withdraw, we cannot just +* dequeue any glocks until our journal is recovered, lest we +* introduce file system corruption. We need two exceptions to this +* rule: We need to allow unlocking of nondisk glocks and the glock +* for our own journal that needs recovery. +*/ + if
[Cluster-devel] [PATCH 07/15] gfs2: Don't write log headers after file system withdraw
Before this patch, when a node withdrew a gfs2 file system, it wrote a (clean) unmount log header. That's wrong. You don't want to write anything to the journal once you're withdrawn because that's acknowledging that the transaction is complete and the journal is in good shape, neither of which may be a valid assumption when the file system is withdrawn. This is especially true if the withdraw was caused due to io errors writing to the journal in the first place. The best course of action is to leave the journal "as is" until it may be safely replayed during journal recovery, regardless of whether it's done by this node or another. Signed-off-by: Bob Peterson --- fs/gfs2/log.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c index bbcf232b3081..69077f92d703 100644 --- a/fs/gfs2/log.c +++ b/fs/gfs2/log.c @@ -679,12 +679,16 @@ void gfs2_write_log_header(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd, { struct gfs2_log_header *lh; u32 hash, crc; - struct page *page = mempool_alloc(gfs2_page_pool, GFP_NOIO); + struct page *page; struct gfs2_statfs_change_host *l_sc = >sd_statfs_local; struct timespec64 tv; struct super_block *sb = sdp->sd_vfs; u64 addr; + if (withdrawn(sdp)) + goto out; + + page = mempool_alloc(gfs2_page_pool, GFP_NOIO); lh = page_address(page); clear_page(lh); @@ -731,6 +735,7 @@ void gfs2_write_log_header(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd, gfs2_log_write(sdp, page, sb->s_blocksize, 0, addr); gfs2_log_submit_bio(>sd_log_bio, REQ_OP_WRITE | op_flags); +out: log_flush_wait(sdp); } -- 2.20.1
[Cluster-devel] [PATCH 09/15] gfs2: Add verbose option to check_journal_clean
Before this patch, function check_journal_clean would give messages related to journal recovery. That's fine for mount time, but when a node withdraws and forces replay that way, we don't want all those distracting and misleading messages. This patch adds a new parameter to make those messages optional. Signed-off-by: Bob Peterson --- fs/gfs2/ops_fstype.c | 2 +- fs/gfs2/util.c | 23 --- fs/gfs2/util.h | 4 +++- 3 files changed, 20 insertions(+), 9 deletions(-) diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index 205900cddfe4..50b336536136 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -684,7 +684,7 @@ static int init_journal(struct gfs2_sbd *sdp, int undo) struct gfs2_jdesc *jd = gfs2_jdesc_find(sdp, x); if (sdp->sd_args.ar_spectator) { - error = check_journal_clean(sdp, jd); + error = check_journal_clean(sdp, jd, true); if (error) goto fail_jinode_gh; continue; diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c index 2f140eb16778..1324ef927aa0 100644 --- a/fs/gfs2/util.c +++ b/fs/gfs2/util.c @@ -48,7 +48,8 @@ void gfs2_assert_i(struct gfs2_sbd *sdp) * * Returns: 0 if the journal is clean or locked, else an error */ -int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd) +int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd, + bool verbose) { int error; struct gfs2_holder j_gh; @@ -59,23 +60,31 @@ int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd) error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_NOEXP | GL_EXACT | GL_NOCACHE, _gh); if (error) { - fs_err(sdp, "Error locking journal for spectator mount.\n"); + if (verbose) + fs_err(sdp, "Error %d locking journal for spectator " + "mount.\n", error); return -EPERM; } error = gfs2_jdesc_check(jd); if (error) { - fs_err(sdp, "Error checking journal for spectator mount.\n"); + if (verbose) + fs_err(sdp, "Error checking journal for spectator " + "mount.\n"); goto out_unlock; } error = gfs2_find_jhead(jd, ); if (error) { - fs_err(sdp, "Error parsing journal for spectator mount.\n"); + if (verbose) + fs_err(sdp, "Error parsing journal for spectator " + "mount.\n"); goto out_unlock; } if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) { error = -EPERM; - fs_err(sdp, "jid=%u: Journal is dirty, so the first mounter " - "must not be a spectator.\n", jd->jd_jid); + if (verbose) + fs_err(sdp, "jid=%u: Journal is dirty, so the first " + "mounter must not be a spectator.\n", + jd->jd_jid); } out_unlock: @@ -163,7 +172,7 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp) /* Now wait until recovery complete. */ for (tries = 0; tries < 10; tries++) { - ret = check_journal_clean(sdp, sdp->sd_jdesc); + ret = check_journal_clean(sdp, sdp->sd_jdesc, false); if (!ret) break; msleep(HZ); diff --git a/fs/gfs2/util.h b/fs/gfs2/util.h index fd72dfd592ab..036c7cfd856d 100644 --- a/fs/gfs2/util.h +++ b/fs/gfs2/util.h @@ -131,7 +131,9 @@ static inline void gfs2_metatype_set(struct buffer_head *bh, u16 type, int gfs2_io_error_i(struct gfs2_sbd *sdp, const char *function, char *file, unsigned int line); -int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd); + +extern int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd, + bool verbose); #define gfs2_io_error(sdp) \ gfs2_io_error_i((sdp), __func__, __FILE__, __LINE__); -- 2.20.1
[Cluster-devel] [PATCH 15/15] gfs2: log which portion of the journal is replayed
When a journal is replayed, gfs2 logs a message similar to: jid=X: Replaying journal... This patch adds the tail and block number so that the range of the replayed block is also printed. These values will match the values shown if the journal is dumped with gfs2_edit -p journalX. The resulting output looks something like this: jid=1: Replaying journal...0x28b7 to 0x2beb This will allow us to better debug file system corruption problems. Signed-off-by: Bob Peterson --- fs/gfs2/recovery.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c index 7389e445a7a7..1012b79364a0 100644 --- a/fs/gfs2/recovery.c +++ b/fs/gfs2/recovery.c @@ -389,7 +389,8 @@ void gfs2_recover_func(struct work_struct *work) } t_tlck = ktime_get(); - fs_info(sdp, "jid=%u: Replaying journal...\n", jd->jd_jid); + fs_info(sdp, "jid=%u: Replaying journal...0x%x to 0x%x\n", + jd->jd_jid, head.lh_tail, head.lh_blkno); for (pass = 0; pass < 2; pass++) { lops_before_scan(jd, , pass); -- 2.20.1
[Cluster-devel] [PATCH 10/15] gfs2: Check for log write errors before telling dlm to unlock
Before this patch, function do_xmote just assumed all the writes submitted to the journal were finished and successful, and it called the go_unlock function to release the dlm lock. But if they're not, and a revoke failed to make its way to the journal, a journal replay on another node will cause corruption if we let the go_inval function continue and tell dlm to release the glock to another node. This patch adds a couple assert_withdraws in do_xmote after the calls to go_sync and go_inval. The asserts should cause another node to replay the journal before continuing, thus protecting rgrp and dinode glocks and maintaining the integrity of the metadata. Signed-off-by: Bob Peterson --- fs/gfs2/glock.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index ba61bba46785..afb336b65abd 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -566,8 +566,12 @@ __acquires(>gl_lockref.lock) spin_unlock(>gl_lockref.lock); if (glops->go_sync) glops->go_sync(gl); + gfs2_assert_withdraw(sdp, atomic_read(>sd_log_errors) == 0); if (test_bit(GLF_INVALIDATE_IN_PROGRESS, >gl_flags)) glops->go_inval(gl, target == LM_ST_DEFERRED ? 0 : DIO_METADATA); + + if (!gfs2_assert_withdraw(sdp, atomic_read(>sd_log_errors) == 0)) + gfs2_assert_withdraw(sdp, !atomic_read(>gl_ail_count)); clear_bit(GLF_INVALIDATE_IN_PROGRESS, >gl_flags); gfs2_glock_hold(gl); -- 2.20.1
[Cluster-devel] [PATCH 14/15] gfs2: Warn when a journal replay overwrites a rgrp with buffers
This patch adds some instrumentation in gfs2's journal replay that indicates when we're about to overwrite a rgrp for which we already have a valid buffer_head. Signed-off-by: Bob Peterson --- fs/gfs2/lops.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c index 0ff32e96c9f5..2594cd5bfdd3 100644 --- a/fs/gfs2/lops.c +++ b/fs/gfs2/lops.c @@ -743,9 +743,27 @@ static int buf_lo_scan_elements(struct gfs2_jdesc *jd, unsigned int start, if (gfs2_meta_check(sdp, bh_ip)) error = -EIO; - else + else { + struct gfs2_meta_header *mh = + (struct gfs2_meta_header *)bh_ip->b_data; + + if (mh->mh_type == cpu_to_be32(GFS2_METATYPE_RG)) { + struct gfs2_rgrpd *rgd; + + rgd = gfs2_blk2rgrpd(sdp, blkno, false); + if (rgd && rgd->rd_addr == blkno && + rgd->rd_bits && rgd->rd_bits->bi_bh) { + fs_info(sdp, "Replaying 0x%llx but we " + "already have a bh!\n", + (unsigned long long)blkno); + fs_info(sdp, "busy:%d, pinned:%d\n", + buffer_busy(rgd->rd_bits->bi_bh) ? 1 : 0, + buffer_pinned(rgd->rd_bits->bi_bh)); + gfs2_dump_glock(NULL, rgd->rd_gl); + } + } mark_buffer_dirty(bh_ip); - + } brelse(bh_log); brelse(bh_ip); -- 2.20.1
[Cluster-devel] [PATCH 06/15] gfs2: Make secondary withdrawers wait for first withdrawer
Before this patch, if a process encountered an error and decided to withdraw, if another process was already in the process of withdrawing, the secondary withdraw would be silently ignored, which set it free to proceed with its processing, unlock any locks, etc. That's correct behavior if the original withdrawer encounters further errors down the road. However, second withdrawers need to wait for the first withdrawer to finish its withdraw before proceeding. If we don't wait we could end up assuming everything is alright, unlock glocks and telling other nodes they can have the glock, despite the fact that a withdraw is still ongoing and may require a journal replay before any locks are released. For example, if an rgrp glock is freed by a process that didn't wait for the withdraw, a journal replay could introduce file system corruption by replaying a rgrp block that has already been granted to another node. This patch makes secondary withdrawers wait until the primary withdrawer is finished with its processing before proceeding. Signed-off-by: Bob Peterson --- fs/gfs2/incore.h | 3 +++ fs/gfs2/util.c | 9 - 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index d84430eceb6c..ef0a88fcd0ca 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -622,6 +622,7 @@ enum { SDF_SKIP_DLM_UNLOCK = 8, SDF_FORCE_AIL_FLUSH = 9, SDF_PENDING_WITHDRAW= 10, /* Will withdraw eventually */ + SDF_WITHDRAW_COMPLETE = 11, /* Withdraw is complete */ }; enum gfs2_freeze_state { @@ -832,6 +833,8 @@ struct gfs2_sbd { wait_queue_head_t sd_log_flush_wait; int sd_log_error; /* First log error */ atomic_t sd_log_errors; /* Count of log errors */ + atomic_t sd_withdrawer; + wait_queue_head_t sd_withdraw_wait; atomic_t sd_reserving_log; wait_queue_head_t sd_reserving_log_wait; diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c index dc00747d35f3..eb92c85ad9bb 100644 --- a/fs/gfs2/util.c +++ b/fs/gfs2/util.c @@ -89,9 +89,15 @@ int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, ...) struct va_format vaf; if (sdp->sd_args.ar_errors == GFS2_ERRORS_WITHDRAW && - test_and_set_bit(SDF_SHUTDOWN, >sd_flags)) + test_and_set_bit(SDF_SHUTDOWN, >sd_flags)) { + fs_warn(sdp, "Waiting for process %d to withdraw.\n", + atomic_read(>sd_withdrawer)); + wait_on_bit(>sd_flags, SDF_WITHDRAW_COMPLETE, + TASK_UNINTERRUPTIBLE); return 0; + } + atomic_set(>sd_withdrawer, pid_nr(task_pid(current))); clear_bit(SDF_PENDING_WITHDRAW, >sd_flags); if (fmt) { va_start(args, fmt); @@ -120,6 +126,7 @@ int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, ...) set_bit(SDF_SKIP_DLM_UNLOCK, >sd_flags); fs_err(sdp, "withdrawn\n"); dump_stack(); + wake_up_bit(>sd_flags, SDF_WITHDRAW_COMPLETE); } if (sdp->sd_args.ar_errors == GFS2_ERRORS_PANIC) -- 2.20.1
[Cluster-devel] [PATCH 12/15] gfs2: If the journal isn't live ignore log flushes
This patch adds a check to function gfs2_log_flush: if the journal is no longer live, the flush is ignored. Signed-off-by: Bob Peterson --- fs/gfs2/log.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c index 0def6343e618..8199b235790f 100644 --- a/fs/gfs2/log.c +++ b/fs/gfs2/log.c @@ -782,6 +782,9 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags) struct gfs2_trans *tr; enum gfs2_freeze_state state = atomic_read(>sd_freeze_state); + if (!test_bit(SDF_JOURNAL_LIVE, >sd_flags)) + return; + down_write(>sd_log_flush_lock); /* Log might have been flushed while we waited for the flush lock */ -- 2.20.1
[Cluster-devel] [PATCH 13/15] gfs2: Issue revokes more intelligently
Before this patch, function gfs2_write_revokes would call gfs2_ail1_empty, then traverse the sd_ail1_list looking for transactions that had bds which were no longer queued to a glock. And if it found some, it would try to issue revokes for them, up to a predetermined maximum. There were two problems with how it did this. First was the fact that gfs2_ail1_empty moves transactions which have nothing remaining on the ail1 list from the sd_ail1_list to the sd_ail2_list, thus making its traversal of sd_ail1_list miss them completely, and therefore, never issue revokes for them. Second was the fact that there were three traversals (or partial traversals) of the sd_ail1_list, each of which took and then released the sd_ail_lock lock. (First inside gfs2_ail1_empty, second to determine if there are any revokes to be issued, and third to actually issue them. All this taking and releasing of the sd_ail_lock meant other processes could modify the lists and the conditions in which we're working. This patch simplies the whole process by adding a new parameter to function gfs2_ail1_empty, max_revokes. For normal calls, this is passed in as 0, meaning we don't want to issue any revokes. For function gfs2_write_revokes, we pass in the maximum number of revokes we can, thus allowing gfs2_ail1_empty to add the revokes where needed. This simplies the code, allows for a single holding of the sd_ail_lock, and allows gfs2_ail1_empty to add revokes for all the necessary bd items without missing any. Signed-off-by: Bob Peterson --- fs/gfs2/log.c | 61 +++ 1 file changed, 22 insertions(+), 39 deletions(-) diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c index 8199b235790f..042ebf701382 100644 --- a/fs/gfs2/log.c +++ b/fs/gfs2/log.c @@ -190,11 +190,13 @@ static void gfs2_ail1_start(struct gfs2_sbd *sdp) /** * gfs2_ail1_empty_one - Check whether or not a trans in the AIL has been synced * @sdp: the filesystem - * @ai: the AIL entry + * @tr: the transaction + * @max_revokes: If nonzero, issue revokes for the bd items for written buffers * */ -static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr) +static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr, + int *max_revokes) { struct gfs2_bufdata *bd, *s; struct buffer_head *bh; @@ -211,18 +213,28 @@ static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr) gfs2_io_error_bh(sdp, bh); set_bit(SDF_PENDING_WITHDRAW, >sd_flags); } - list_move(>bd_ail_st_list, >tr_ail2_list); + /* If we have space for revokes and the bd is no longer on any + buf list, we can just add a revoke for it here and avoid + having to put it on the ail2 list, where it would need to + be revoked later. */ + if (*max_revokes && list_empty(>bd_list)) { + gfs2_add_revoke(sdp, bd); + (*max_revokes)--; + } else { + list_move(>bd_ail_st_list, >tr_ail2_list); + } } } /** * gfs2_ail1_empty - Try to empty the ail1 lists * @sdp: The superblock + * @max_revokes: If non-zero, add revokes where appropriate * * Tries to empty the ail1 lists, starting with the oldest first */ -static int gfs2_ail1_empty(struct gfs2_sbd *sdp) +static int gfs2_ail1_empty(struct gfs2_sbd *sdp, int max_revokes) { struct gfs2_trans *tr, *s; int oldest_tr = 1; @@ -230,7 +242,7 @@ static int gfs2_ail1_empty(struct gfs2_sbd *sdp) spin_lock(>sd_ail_lock); list_for_each_entry_safe_reverse(tr, s, >sd_ail1_list, tr_list) { - gfs2_ail1_empty_one(sdp, tr); + gfs2_ail1_empty_one(sdp, tr, _revokes); if (list_empty(>tr_ail1_list) && oldest_tr) list_move(>tr_list, >sd_ail2_list); else @@ -610,25 +622,9 @@ void gfs2_add_revoke(struct gfs2_sbd *sdp, struct gfs2_bufdata *bd) void gfs2_write_revokes(struct gfs2_sbd *sdp) { - struct gfs2_trans *tr; - struct gfs2_bufdata *bd, *tmp; - int have_revokes = 0; int max_revokes = (sdp->sd_sb.sb_bsize - sizeof(struct gfs2_log_descriptor)) / sizeof(u64); - gfs2_ail1_empty(sdp); - spin_lock(>sd_ail_lock); - list_for_each_entry_reverse(tr, >sd_ail1_list, tr_list) { - list_for_each_entry(bd, >tr_ail2_list, bd_ail_st_list) { - if (list_empty(>bd_list)) { - have_revokes = 1; - goto done; - } - } - } -done: - spin_unlock(>sd_ail_lock); - if (have_revokes == 0) - return; + gfs2_log_lock(sdp); while (sdp->sd_log_num_revoke > max_revokes)
[Cluster-devel] [PATCH 01/15] gfs2: log error reform
Before this patch, gfs2 kept track of journal io errors in two places (sd_log_error and the SDF_AIL1_IO_ERROR flag in sd_flags. This patch consolidates the two by eliminating the SDF_AIL1_IO_ERROR flag in favor of an atomic count of journal errors, sd_log_errors. When the first io error occurs and its value is incremented to 1, sd_log_error is set. Thus, sd_log_error reflects the first error we encountered writing to the journal. In future patches, we will take advantage of this by checking a single value (sd_log_errors) rather than having to check both the flag and the sd_log_error when reacting to io errors. Signed-off-by: Bob Peterson --- fs/gfs2/incore.h | 4 ++-- fs/gfs2/log.c| 7 --- fs/gfs2/lops.c | 7 +-- fs/gfs2/ops_fstype.c | 1 + fs/gfs2/quota.c | 6 -- 5 files changed, 16 insertions(+), 9 deletions(-) diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index cdf07b408f54..76336b592030 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -620,7 +620,6 @@ enum { SDF_RORECOVERY = 7, /* read only recovery */ SDF_SKIP_DLM_UNLOCK = 8, SDF_FORCE_AIL_FLUSH = 9, - SDF_AIL1_IO_ERROR = 10, }; enum gfs2_freeze_state { @@ -829,7 +828,8 @@ struct gfs2_sbd { atomic_t sd_log_in_flight; struct bio *sd_log_bio; wait_queue_head_t sd_log_flush_wait; - int sd_log_error; + int sd_log_error; /* First log error */ + atomic_t sd_log_errors; /* Count of log errors */ atomic_t sd_reserving_log; wait_queue_head_t sd_reserving_log_wait; diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c index 5bfaf381921a..2393fa758115 100644 --- a/fs/gfs2/log.c +++ b/fs/gfs2/log.c @@ -109,8 +109,8 @@ __acquires(>sd_ail_lock) if (!buffer_busy(bh)) { if (!buffer_uptodate(bh) && - !test_and_set_bit(SDF_AIL1_IO_ERROR, - >sd_flags)) { + atomic_add_return(1, >sd_log_errors) == 1) { + sdp->sd_log_error = -EIO; gfs2_io_error_bh(sdp, bh); *withdraw = true; } @@ -209,7 +209,8 @@ static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, struct gfs2_trans *tr, if (buffer_busy(bh)) continue; if (!buffer_uptodate(bh) && - !test_and_set_bit(SDF_AIL1_IO_ERROR, >sd_flags)) { + atomic_add_return(1, >sd_log_errors) == 1) { + sdp->sd_log_error = -EIO; gfs2_io_error_bh(sdp, bh); *withdraw = true; } diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c index 94dcab655bc0..0ff32e96c9f5 100644 --- a/fs/gfs2/lops.c +++ b/fs/gfs2/lops.c @@ -210,8 +210,11 @@ static void gfs2_end_log_write(struct bio *bio) int i; if (bio->bi_status) { - fs_err(sdp, "Error %d writing to journal, jid=%u\n", - bio->bi_status, sdp->sd_jdesc->jd_jid); + if (atomic_add_return(1, >sd_log_errors) == 1) { + sdp->sd_log_error = bio->bi_status; + fs_err(sdp, "Error %d writing to journal, jid=%u\n", + bio->bi_status, sdp->sd_jdesc->jd_jid); + } wake_up(>sd_logd_waitq); } diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index 1179763f6370..ec503afc07bb 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -130,6 +130,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb) init_rwsem(>sd_log_flush_lock); atomic_set(>sd_log_in_flight, 0); + atomic_set(>sd_log_errors, 0); atomic_set(>sd_reserving_log, 0); init_waitqueue_head(>sd_reserving_log_wait); init_waitqueue_head(>sd_log_flush_wait); diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c index 2ae5a109eea7..009172ef4dfe 100644 --- a/fs/gfs2/quota.c +++ b/fs/gfs2/quota.c @@ -1479,8 +1479,10 @@ static void quotad_error(struct gfs2_sbd *sdp, const char *msg, int error) if (error == 0 || error == -EROFS) return; if (!test_bit(SDF_SHUTDOWN, >sd_flags)) { - fs_err(sdp, "gfs2_quotad: %s error %d\n", msg, error); - sdp->sd_log_error = error; + if (atomic_add_return(1, >sd_log_errors) == 1) { + sdp->sd_log_error = error; + fs_err(sdp, "gfs2_quotad: %s error %d\n", msg, error); + } wake_up(>sd_logd_waitq); } } -- 2.20.1
[Cluster-devel] [PATCH 03/15] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
This patch addresses various problems with gfs2/dlm recovery. For example, suppose a node with a bunch of gfs2 mounts suddenly reboots due to kernel panic, and dlm determines it should perform recovery. DLM does so from a pseudo-state machine calling various callbacks into lock_dlm to perform a sequence of steps. It uses generation numbers and recover bits in dlm "control" lock lvbs. Now suppose another node tries to recover the failed node's journal, but in so doing, encounters an IO error or withdraws due to unforeseen circumstances, such as an hba driver failure. In these cases, the recovery would eventually bail out, but it would still update its generation number in the lvb. The other nodes would all see the newer generation number and think they don't need to do recovery because the generation number is newer than the last one they saw, and therefore someone else has already taken care of it. If the file system has an io error or is withdrawn, it cannot safely replay any journals (its own or others) but someone else still needs to do it. Therefore we don't want it messing with the journal recovery generation numbers: the local generation numbers eventually get put into the lvb generation numbers to be seen by all nodes. This patch adds checks to many of the callbacks used by dlm in its recovery state machine so that the functions are ignored and skipped if an io error has occurred or if the file system was withdraw. Signed-off-by: Bob Peterson --- fs/gfs2/lock_dlm.c | 36 fs/gfs2/util.c | 2 +- 2 files changed, 37 insertions(+), 1 deletion(-) diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c index 31df26ed7854..8b94f34c5c0f 100644 --- a/fs/gfs2/lock_dlm.c +++ b/fs/gfs2/lock_dlm.c @@ -1081,6 +1081,14 @@ static void gdlm_recover_prep(void *arg) struct gfs2_sbd *sdp = arg; struct lm_lockstruct *ls = >sd_lockstruct; + if (atomic_read(>sd_log_errors)) { + fs_err(sdp, "recover_prep ignored due to io error.\n"); + return; + } + if (withdrawn(sdp)) { + fs_err(sdp, "recover_prep ignored due to withdraw.\n"); + return; + } spin_lock(>ls_recover_spin); ls->ls_recover_block = ls->ls_recover_start; set_bit(DFL_DLM_RECOVERY, >ls_recover_flags); @@ -1103,6 +,16 @@ static void gdlm_recover_slot(void *arg, struct dlm_slot *slot) struct lm_lockstruct *ls = >sd_lockstruct; int jid = slot->slot - 1; + if (atomic_read(>sd_log_errors)) { + fs_err(sdp, "recover_slot jid %d ignored due to io error.\n", + jid); + return; + } + if (withdrawn(sdp)) { + fs_err(sdp, "recover_slot jid %d ignored due to withdraw.\n", + jid); + return; + } spin_lock(>ls_recover_spin); if (ls->ls_recover_size < jid + 1) { fs_err(sdp, "recover_slot jid %d gen %u short size %d\n", @@ -1127,6 +1145,14 @@ static void gdlm_recover_done(void *arg, struct dlm_slot *slots, int num_slots, struct gfs2_sbd *sdp = arg; struct lm_lockstruct *ls = >sd_lockstruct; + if (atomic_read(>sd_log_errors)) { + fs_err(sdp, "recover_done ignored due to io error.\n"); + return; + } + if (withdrawn(sdp)) { + fs_err(sdp, "recover_done ignored due to withdraw.\n"); + return; + } /* ensure the ls jid arrays are large enough */ set_recover_size(sdp, slots, num_slots); @@ -1154,6 +1180,16 @@ static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid, { struct lm_lockstruct *ls = >sd_lockstruct; + if (atomic_read(>sd_log_errors)) { + fs_err(sdp, "recovery_result jid %d ignored due to io error.\n", + jid); + return; + } + if (withdrawn(sdp)) { + fs_err(sdp, "recovery_result jid %d ignored due to withdraw.\n", + jid); + return; + } if (test_bit(DFL_NO_DLM_OPS, >ls_recover_flags)) return; diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c index 717aef772c60..ca6de80b5e8b 100644 --- a/fs/gfs2/util.c +++ b/fs/gfs2/util.c @@ -260,7 +260,7 @@ void gfs2_io_error_bh_i(struct gfs2_sbd *sdp, struct buffer_head *bh, const char *function, char *file, unsigned int line, bool withdraw) { - if (!test_bit(SDF_SHUTDOWN, >sd_flags)) + if (!withdrawn(sdp)) fs_err(sdp, "fatal: I/O error\n" " block = %llu\n" -- 2.20.1
[Cluster-devel] [PATCH 00/15] GFS2: Withdraw corruption patches [V2]
Hi, This is a revision to the patch set I sent on 13 February 2019. These won't make this merge window, obviously, because that's almost upon us. This version fixes some glaring mistakes and problems of the first set. As before, this may not be the final version, but I wanted to put it out for review anyway. Among changes from the original are: 1. I fixed some really stupid mistakes of the original patch set. 2. I found and fixed several additional problems not covered by the first patch set. 3. I broke up the patch "Force withdraw to replay journals and wait for it to finish" into more reasonaly sized pieces. It's still complex, but not nearly as bad as the original. 4. I included some of the instrumentation I've used to detect file system corruption. It makes sense to include them in mainline, I think. 5. I still need to figure out what to do about Dave Teigland's observation regarding the patch "dlm: recover slot regardless of whether we still have a connection". The patch is omitted in this set until I figure out a reasonable course of action. This version is much more stable. I've still been able to break it, given enough pressure, but I think that's an additional bug. I'll continue to chase it, and will post further patches, if necessary. These patches address a bunch of problems related to journal replay overwriting valid gfs2 metadata due to io errors, withdraws and such. These seem to fix several metadata corruption problems I've been able to reliably recreate lately with multi-node multi-file system recovery tests. Bob Peterson (15): gfs2: log error reform gfs2: Introduce concept of a pending withdraw gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn gfs2: move check_journal_clean to util.c for future use gfs2: Allow some glocks to be used during withdraw gfs2: Make secondary withdrawers wait for first withdrawer gfs2: Don't write log headers after file system withdraw gfs2: Force withdraw to replay journals and wait for it to finish gfs2: Add verbose option to check_journal_clean gfs2: Check for log write errors before telling dlm to unlock gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty gfs2: If the journal isn't live ignore log flushes gfs2: Issue revokes more intelligently gfs2: Warn when a journal replay overwrites a rgrp with buffers gfs2: log which portion of the journal is replayed fs/gfs2/aops.c | 4 +- fs/gfs2/file.c | 2 +- fs/gfs2/glock.c | 39 -- fs/gfs2/glock.h | 1 + fs/gfs2/glops.c | 82 - fs/gfs2/incore.h | 14 +++- fs/gfs2/lock_dlm.c | 68 + fs/gfs2/log.c| 94 +++- fs/gfs2/log.h| 1 + fs/gfs2/lops.c | 29 +++- fs/gfs2/meta_io.c| 6 +- fs/gfs2/ops_fstype.c | 52 ++--- fs/gfs2/quota.c | 8 +- fs/gfs2/recovery.c | 3 +- fs/gfs2/super.c | 30 fs/gfs2/super.h | 1 + fs/gfs2/sys.c| 2 +- fs/gfs2/util.c | 171 ++- fs/gfs2/util.h | 11 +++ 19 files changed, 477 insertions(+), 141 deletions(-) -- 2.20.1
[Cluster-devel] [PATCH 05/15] gfs2: Allow some glocks to be used during withdraw
Before this patch, when a file system was withdrawn, all further attempts to enqueue or promote glocks were rejected and returned -EIO. This is only important for media-backed glocks like inode and rgrp glocks. All other glocks may be safely used because there is no potential for metadata corruption. This patch allows some glocks to be used even after the file system is withdrawn. This is accomplished with a new glops flag, GLOF_OK_AT_WITHDRAW. This facilitates future patches that enhance fs withdraw. Signed-off-by: Bob Peterson --- fs/gfs2/glock.c | 4 +++- fs/gfs2/glops.c | 8 ++-- fs/gfs2/incore.h | 1 + 3 files changed, 10 insertions(+), 3 deletions(-) diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index c6d6e478f5e3..25923b9e18ef 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -543,6 +543,7 @@ __acquires(>gl_lockref.lock) int ret; if (unlikely(withdrawn(sdp)) && + !(glops->go_flags & GLOF_OK_AT_WITHDRAW) && target != LM_ST_UNLOCKED) return; lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP | @@ -1091,7 +1092,8 @@ int gfs2_glock_nq(struct gfs2_holder *gh) struct gfs2_sbd *sdp = gl->gl_name.ln_sbd; int error = 0; - if (unlikely(withdrawn(sdp))) + if (unlikely(withdrawn(sdp)) && + !(gl->gl_ops->go_flags & GLOF_OK_AT_WITHDRAW)) return -EIO; if (test_bit(GLF_LRU, >gl_flags)) diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c index 9c86c8004ba7..4ba0f21aa312 100644 --- a/fs/gfs2/glops.c +++ b/fs/gfs2/glops.c @@ -583,6 +583,7 @@ static void iopen_go_callback(struct gfs2_glock *gl, bool remote) const struct gfs2_glock_operations gfs2_meta_glops = { .go_type = LM_TYPE_META, + .go_flags = GLOF_OK_AT_WITHDRAW, }; const struct gfs2_glock_operations gfs2_inode_glops = { @@ -610,6 +611,7 @@ const struct gfs2_glock_operations gfs2_freeze_glops = { .go_xmote_bh = freeze_go_xmote_bh, .go_demote_ok = freeze_go_demote_ok, .go_type = LM_TYPE_NONDISK, + .go_flags = GLOF_OK_AT_WITHDRAW, }; const struct gfs2_glock_operations gfs2_iopen_glops = { @@ -620,20 +622,22 @@ const struct gfs2_glock_operations gfs2_iopen_glops = { const struct gfs2_glock_operations gfs2_flock_glops = { .go_type = LM_TYPE_FLOCK, - .go_flags = GLOF_LRU, + .go_flags = GLOF_LRU | GLOF_OK_AT_WITHDRAW, }; const struct gfs2_glock_operations gfs2_nondisk_glops = { .go_type = LM_TYPE_NONDISK, + .go_flags = GLOF_OK_AT_WITHDRAW, }; const struct gfs2_glock_operations gfs2_quota_glops = { .go_type = LM_TYPE_QUOTA, - .go_flags = GLOF_LVB | GLOF_LRU, + .go_flags = GLOF_LVB | GLOF_LRU | GLOF_OK_AT_WITHDRAW, }; const struct gfs2_glock_operations gfs2_journal_glops = { .go_type = LM_TYPE_JOURNAL, + .go_flags = GLOF_OK_AT_WITHDRAW, }; const struct gfs2_glock_operations *gfs2_glops_list[] = { diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index c6984265807f..d84430eceb6c 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -250,6 +250,7 @@ struct gfs2_glock_operations { #define GLOF_ASPACE 1 #define GLOF_LVB2 #define GLOF_LRU4 +#define GLOF_OK_AT_WITHDRAW 8 }; enum { -- 2.20.1
[Cluster-devel] [PATCH 04/15] gfs2: move check_journal_clean to util.c for future use
Before this patch function check_journal_clean was in ops_fstype.c. This patch moves it to util.c so we can make use of it elsewhere in a future patch. Signed-off-by: Bob Peterson --- fs/gfs2/ops_fstype.c | 42 - fs/gfs2/util.c | 45 fs/gfs2/util.h | 1 + 3 files changed, 46 insertions(+), 42 deletions(-) diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index 868bfbff1845..881446df7891 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -592,48 +592,6 @@ static int gfs2_jindex_hold(struct gfs2_sbd *sdp, struct gfs2_holder *ji_gh) return error; } -/** - * check_journal_clean - Make sure a journal is clean for a spectator mount - * @sdp: The GFS2 superblock - * @jd: The journal descriptor - * - * Returns: 0 if the journal is clean or locked, else an error - */ -static int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd) -{ - int error; - struct gfs2_holder j_gh; - struct gfs2_log_header_host head; - struct gfs2_inode *ip; - - ip = GFS2_I(jd->jd_inode); - error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_NOEXP | - GL_EXACT | GL_NOCACHE, _gh); - if (error) { - fs_err(sdp, "Error locking journal for spectator mount.\n"); - return -EPERM; - } - error = gfs2_jdesc_check(jd); - if (error) { - fs_err(sdp, "Error checking journal for spectator mount.\n"); - goto out_unlock; - } - error = gfs2_find_jhead(jd, ); - if (error) { - fs_err(sdp, "Error parsing journal for spectator mount.\n"); - goto out_unlock; - } - if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) { - error = -EPERM; - fs_err(sdp, "jid=%u: Journal is dirty, so the first mounter " - "must not be a spectator.\n", jd->jd_jid); - } - -out_unlock: - gfs2_glock_dq_uninit(_gh); - return error; -} - static int init_journal(struct gfs2_sbd *sdp, int undo) { struct inode *master = d_inode(sdp->sd_master_dir); diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c index ca6de80b5e8b..dc00747d35f3 100644 --- a/fs/gfs2/util.c +++ b/fs/gfs2/util.c @@ -19,7 +19,10 @@ #include "gfs2.h" #include "incore.h" #include "glock.h" +#include "lops.h" +#include "recovery.h" #include "rgrp.h" +#include "super.h" #include "util.h" struct kmem_cache *gfs2_glock_cachep __read_mostly; @@ -36,6 +39,48 @@ void gfs2_assert_i(struct gfs2_sbd *sdp) fs_emerg(sdp, "fatal assertion failed\n"); } +/** + * check_journal_clean - Make sure a journal is clean for a spectator mount + * @sdp: The GFS2 superblock + * @jd: The journal descriptor + * + * Returns: 0 if the journal is clean or locked, else an error + */ +int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd) +{ + int error; + struct gfs2_holder j_gh; + struct gfs2_log_header_host head; + struct gfs2_inode *ip; + + ip = GFS2_I(jd->jd_inode); + error = gfs2_glock_nq_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_NOEXP | + GL_EXACT | GL_NOCACHE, _gh); + if (error) { + fs_err(sdp, "Error locking journal for spectator mount.\n"); + return -EPERM; + } + error = gfs2_jdesc_check(jd); + if (error) { + fs_err(sdp, "Error checking journal for spectator mount.\n"); + goto out_unlock; + } + error = gfs2_find_jhead(jd, ); + if (error) { + fs_err(sdp, "Error parsing journal for spectator mount.\n"); + goto out_unlock; + } + if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) { + error = -EPERM; + fs_err(sdp, "jid=%u: Journal is dirty, so the first mounter " + "must not be a spectator.\n", jd->jd_jid); + } + +out_unlock: + gfs2_glock_dq_uninit(_gh); + return error; +} + int gfs2_lm_withdraw(struct gfs2_sbd *sdp, const char *fmt, ...) { struct lm_lockstruct *ls = >sd_lockstruct; diff --git a/fs/gfs2/util.h b/fs/gfs2/util.h index 16e087da3bd3..fd72dfd592ab 100644 --- a/fs/gfs2/util.h +++ b/fs/gfs2/util.h @@ -131,6 +131,7 @@ static inline void gfs2_metatype_set(struct buffer_head *bh, u16 type, int gfs2_io_error_i(struct gfs2_sbd *sdp, const char *function, char *file, unsigned int line); +int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd); #define gfs2_io_error(sdp) \ gfs2_io_error_i((sdp), __func__, __FILE__, __LINE__); -- 2.20.1
Re: [Cluster-devel] iomap_dio_rw: lockdep_assert_held(>i_rwsem)
Hi Christoph, I've tried to get your feedback on the lockdep_assert_held in iomap_dio_rw in January and didn't hear back from you. Could you please have a look? On Tue, 22 Jan 2019 at 22:26, Andreas Gruenbacher wrote: > > Hi Christoph, > > there's an assertion that the inode rwsem is taken in iomap_dio_rw > which currently triggers on gfs2 via gfs2_file_read_iter -> > gfs2_file_direct_read -> iomap_dio_rw in fs/gfs2/file.c. > > We're not using inode_[un]lock_shared on that code path because gfs2 > uses its own cluster-wide locking already. Is there a reason why the > inode needs to be locked locally anyway? If not, I'd like to avoid > adding unnecessary lock taking in gfs2_file_direct_read just to > silence iomap_dio_rw. Thanks, Andreas
Re: [Cluster-devel] GFS2, lockdep and deadlocks on 4.19
Edwin, On Wed, 27 Feb 2019 at 17:11, Edwin Török wrote: > I was trying to use lockdep to track down a GFS2 deadlock in kernel 4.19.19. > I was not able to reproduce the deadlock with CONFIG_PROVE_LOCKING, however I > got quite a lot of warnings about GFS2 code calling functions without holding > prerequisite locks. > I would hope that these are false positives (GFS2 would hold a glock, and not > the local inode lock), but would be nice if someone could confirm that. > Should the glocks also take the local inode locks to prevent these warnings, > and avoid generic Linux code from interfering with the inodes? I've stumbled upon those two warnings as well. Still need to analyze what's going on exactly. > LOCKDEP issue #1 > > > /** > * inode_to_wb - determine the wb of an inode > * @inode: inode of interest > * > * Returns the wb @inode is currently associated with. The caller must be > * holding either @inode->i_lock, the i_pages lock, or the > * associated wb's list_lock. > */ > static inline struct bdi_writeback *inode_to_wb(const struct inode *inode) > { > #ifdef CONFIG_LOCKDEP > WARN_ON_ONCE(debug_locks && > (!lockdep_is_held(>i_lock) && > !lockdep_is_held(>i_mapping->i_pages.xa_lock) && > !lockdep_is_held(>i_wb->list_lock))); The first one seems to be about accessing internal inodes via generic functions which are not reachable from the outside, so we hopefully shouldn't need the locking. > Feb 25 22:14:57 localhost kernel: [ 378.925268] WARNING: CPU: 6 PID: 15102 > at ./include/linux/backing-dev.h:342 account_page_dirtied+0x296/0x340 > LOCKDEP issue #2: > - > > size_t > iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > const struct iomap_ops *ops, iomap_dio_end_io_t end_io) > { > struct address_space *mapping = iocb->ki_filp->f_mapping; > struct inode *inode = file_inode(iocb->ki_filp); > size_t count = iov_iter_count(iter); > loff_t pos = iocb->ki_pos, start = pos; > loff_t end = iocb->ki_pos + count - 1, ret = 0; > unsigned int flags = IOMAP_DIRECT; > struct blk_plug plug; > struct iomap_dio *dio; > > lockdep_assert_held(>i_rwsem) I've asked Christoph Hellwig about this one a month ago; he must have overlooked my message: https://www.redhat.com/archives/cluster-devel/2019-January/msg00090.html Making the warning go away just by adding the locking would be trivial, but let's see what Christoph says. Thanks, Andreas