Re: [Cluster-devel] [syzbot] [gfs2?] WARNING: suspicious RCU usage in gfs2_permission

2023-10-30 Thread Andreas Gruenbacher
Al,

On Wed, Oct 25, 2023 at 5:29 AM Al Viro  wrote:
> On Fri, Oct 20, 2023 at 12:10:38AM -0700, syzbot wrote:
> > syzbot has bisected this issue to:
> >
> > commit 0abd1557e21c617bd13fc18f7725fc6363c05913
> > Author: Al Viro 
> > Date:   Mon Oct 2 02:33:44 2023 +
> >
> > gfs2: fix an oops in gfs2_permission
> >
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10b21c3368
> > start commit:   2dac75696c6d Add linux-next specific files for 20231018
> > git tree:   linux-next
> > final oops: https://syzkaller.appspot.com/x/report.txt?x=12b21c3368
> > console output: https://syzkaller.appspot.com/x/log.txt?x=14b21c3368
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=6f8545e1ef7a2b66
> > dashboard link: https://syzkaller.appspot.com/bug?extid=3e5130844b0c0e2b4948
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=101c8d0968
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11a0747568
> >
> > Reported-by: syzbot+3e5130844b0c0e2b4...@syzkaller.appspotmail.com
> > Fixes: 0abd1557e21c ("gfs2: fix an oops in gfs2_permission")
> >
> > For information about bisection process see: https://goo.gl/tpsmEJ#bisection
>
> Complaints about rcu_dereference() outside of rcu_read_lock().
>
> We could replace that line with
> if (mask & MAY_NOT_BLOCK)
> gl = rcu_dereference(ip->i_gl);
> else
> gl = ip->i_gl;
> or by any equivalent way to tell lockdep it ought to STFU.

the following should do then, right?

gl = rcu_dereference_check(ip->i_gl, !(mask & MAY_NOT_BLOCK));

> BTW, the amount of rcu_dereference_protected(..., true) is somewhat 
> depressing...
>
> Probably need to turn
> ip->i_gl = NULL;
> in the end of gfs2_evict_inode() into rcu_assign_pointer(ip->i_gl, NULL);

That's what commit 0abd1557e21c6 does already so there's nothing to fix, right?

> and transpose it with the previous line -
> gfs2_glock_put_eventually(ip->i_gl);
>
> I don't think it really matters in this case, though - destruction of the 
> object
> it used to point to is delayed in all cases.  Matter of taste (and lockdep
> false positives)...

I don't understand. What would lockdep complain about here?

Thanks,
Andreas



[Cluster-devel] [GIT PULL] gfs2 fixes

2023-09-05 Thread Andreas Gruenbacher
Hi Linus,

please consider pulling the following gfs2 fixes (*) into the current merge 
window.

(*) Technically, this updates the address of the shared gfs2 and dlm mailing 
list
so this affects dlm as well; I've coordinated this change with David 
Teigland.

Thanks,
Andreas

The following changes since commit 02aee814d37c563e24b73bcd0f9cb608fbd403d4:

  Merge tag 'gfs2-v6.4-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 (2023-08-08 
09:27:08 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
tags/gfs2-v6.5-rc5-fixes

for you to fetch changes up to 2938fd750e8b73a6dac4d9339fb6f7f1cd624a2d:

  MAINTAINERS: Update dlm mailing list (2023-09-05 17:43:07 +0200)


gfs2 fixes

- Fix a glock state (non-)transition bug when a dlm request times out
  and is canceled, and we have locking requests that can now be granted
  immediately.

- Various fixes and cleanups in how the logd and quotad daemons are
  woken up and terminated.

- Fix several bugs in the quota data reference counting and shrinking.
  Free quota data objects synchronously in put_super() instead of
  letting call_rcu() run wild.

- Make sure not to deallocate quota data during a withdraw; rather, defer
  quota data deallocation to put_super().  Withdraws can happen in
  contexts in which callers on the stack are holding quota data references.

- Many minor quota fixes and cleanups by Bob.

- Update the the mailing list address for gfs2 and dlm.  (It's the same
  list for both and we are moving it to g...@lists.linux.dev.)

- Various other minor cleanups.


Andreas Gruenbacher (24):
  gfs2: Use mapping->gfp_mask for metadata inodes
  gfs: Don't use GFP_NOFS in gfs2_unstuff_dinode
  gfs2: do_promote cleanup
  gfs2: Remove LM_FLAG_PRIORITY flag
  gfs2: Switch to wait_event in gfs2_logd
  gfs2: low-memory forced flush fixes
  gfs2: Fix logd wakeup on I/O error
  gfs2: journal flush threshold fixes and cleanup
  gfs2: Rename sd_{ glock => kill }_wait
  gfs2: Rename SDF_DEACTIVATING to SDF_KILL
  gfs2: Fix wrong quota shrinker return value
  gfs2: Use gfs2_qd_dispose in gfs2_quota_cleanup
  gfs2: Factor out duplicate quota data disposal code
  gfs2: No more quota complaints after withdraw
  gfs2: Fix initial quota data refcount
  gfs2: Free quota data objects synchronously
  gfs2: Stop using gfs2_make_fs_ro for withdraw
  gfs2: Fix asynchronous thread destruction
  gfs2: Switch to wait_event in gfs2_quotad
  gfs2: Sanitize kthread stopping
  gfs2: Fix withdraw race
  gfs2: Rename "gfs_recovery" workqueue to "gfs2_recovery"
  gfs2: Rename "freeze_workqueue" to "gfs2_freeze"
  gfs2: Add device name to gfs2_logd and gfs2_quotad

Andrew Price (2):
  MAINTAINERS: Update gfs2 mailing list
  MAINTAINERS: Update dlm mailing list

Bob Peterson (24):
  gfs2: conversion deadlock do_promote bypass
  gfs2: Use qd_sbd more consequently
  gfs2: Introduce new quota=quiet mount option
  gfs2: remove dead code for quota writes
  gfs2: Pass sdp to gfs2_adjust_quota
  gfs2: pass sdp in to gfs2_write_disk_quota
  gfs2: pass sdp to gfs2_write_buf_to_page
  gfs2: remove unneeded variable done
  gfs2: remove unneeded pg_oflow variable
  gfs2: Simplify function need_sync
  gfs2: Don't try to sync non-changes
  gfs2: improvements to sysfs status
  gfs2: move qdsb_put and reduce redundancy
  gfs2: Small gfs2_quota_lock cleanup
  gfs2: Remove useless err set
  gfs2: Set qd_sync_gen in do_sync
  gfs2: use constant for array size
  gfs2: Remove quota allocation info from quota file
  gfs2: introduce qd_bh_get_or_undo
  gfs2: Simplify qd2offset
  gfs2: simplify slot_get
  gfs2: Remove useless assignment
  gfs2: check for no eligible quota changes
  gfs2: change qd_slot_count to qd_slot_ref

Minjie Du (1):
  gfs2: increase usage of folio_next_index() helper

 Documentation/filesystems/gfs2-glocks.rst |   3 +-
 MAINTAINERS   |   4 +-
 fs/gfs2/aops.c|   7 +-
 fs/gfs2/bmap.c|   2 +-
 fs/gfs2/glock.c   |  47 ++--
 fs/gfs2/glock.h   |   9 -
 fs/gfs2/glops.c   |   2 +-
 fs/gfs2/incore.h  |   7 +-
 fs/gfs2/inode.c   |  14 +-
 fs/gfs2/lock_dlm.c|   5 -
 fs/gfs2/log.c |  69 +++---
 fs/gfs2/lops.c|   7 +-
 fs/gfs2/main.c|  10 +-
 fs/gfs2/ops_fstype.c  |  42 ++--
 fs/gfs2/quota.c  

Re: [Cluster-devel] [PATCH] gfs2: Fix uaf for qda in gfs2_quota_sync

2023-08-24 Thread Andreas Gruenbacher
I've pushed the fixes for this bug to for-next.

https://lore.kernel.org/cluster-devel/20230824211948.3242607-1-agrue...@redhat.com/

Thanks,
Andreas



[Cluster-devel] [PATCH 1/9] gfs2: Use qd_sbd more consequently

2023-08-24 Thread Andreas Gruenbacher
From: Bob Peterson 

Before this patch many of the functions in quota.c got their superblock
pointer, sdp, from the quota_data's glock pointer. That's silly because
the qd already has its own pointer to the superblock (qd_sbd).

This patch changes references to use that instead, eliminating a level
of indirection.

Signed-off-by: Bob Peterson 
Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/quota.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 96d41ee034d7..48b9fbffe260 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -119,7 +119,7 @@ static void gfs2_qd_dispose(struct list_head *list)
 
while (!list_empty(list)) {
qd = list_first_entry(list, struct gfs2_quota_data, qd_lru);
-   sdp = qd->qd_gl->gl_name.ln_sbd;
+   sdp = qd->qd_sbd;
 
list_del(>qd_lru);
 
@@ -302,7 +302,7 @@ static int qd_get(struct gfs2_sbd *sdp, struct kqid qid,
 
 static void qd_hold(struct gfs2_quota_data *qd)
 {
-   struct gfs2_sbd *sdp = qd->qd_gl->gl_name.ln_sbd;
+   struct gfs2_sbd *sdp = qd->qd_sbd;
gfs2_assert(sdp, !__lockref_is_dead(>qd_lockref));
lockref_get(>qd_lockref);
 }
@@ -367,7 +367,7 @@ static void slot_put(struct gfs2_quota_data *qd)
 
 static int bh_get(struct gfs2_quota_data *qd)
 {
-   struct gfs2_sbd *sdp = qd->qd_gl->gl_name.ln_sbd;
+   struct gfs2_sbd *sdp = qd->qd_sbd;
struct inode *inode = sdp->sd_qc_inode;
struct gfs2_inode *ip = GFS2_I(inode);
unsigned int block, offset;
@@ -421,7 +421,7 @@ static int bh_get(struct gfs2_quota_data *qd)
 
 static void bh_put(struct gfs2_quota_data *qd)
 {
-   struct gfs2_sbd *sdp = qd->qd_gl->gl_name.ln_sbd;
+   struct gfs2_sbd *sdp = qd->qd_sbd;
 
mutex_lock(>sd_quota_mutex);
gfs2_assert(sdp, qd->qd_bh_count);
@@ -489,8 +489,7 @@ static int qd_fish(struct gfs2_sbd *sdp, struct 
gfs2_quota_data **qdp)
 
 static void qd_unlock(struct gfs2_quota_data *qd)
 {
-   gfs2_assert_warn(qd->qd_gl->gl_name.ln_sbd,
-test_bit(QDF_LOCKED, >qd_flags));
+   gfs2_assert_warn(qd->qd_sbd, test_bit(QDF_LOCKED, >qd_flags));
clear_bit(QDF_LOCKED, >qd_flags);
bh_put(qd);
slot_put(qd);
@@ -666,7 +665,7 @@ static int sort_qd(const void *a, const void *b)
 
 static void do_qc(struct gfs2_quota_data *qd, s64 change, int qc_type)
 {
-   struct gfs2_sbd *sdp = qd->qd_gl->gl_name.ln_sbd;
+   struct gfs2_sbd *sdp = qd->qd_sbd;
struct gfs2_inode *ip = GFS2_I(sdp->sd_qc_inode);
struct gfs2_quota_change *qc = qd->qd_bh_qc;
s64 x;
@@ -881,7 +880,7 @@ static int gfs2_adjust_quota(struct gfs2_inode *ip, loff_t 
loc,
 
 static int do_sync(unsigned int num_qd, struct gfs2_quota_data **qda)
 {
-   struct gfs2_sbd *sdp = (*qda)->qd_gl->gl_name.ln_sbd;
+   struct gfs2_sbd *sdp = (*qda)->qd_sbd;
struct gfs2_inode *ip = GFS2_I(sdp->sd_quota_inode);
struct gfs2_alloc_parms ap = { .aflags = 0, };
unsigned int data_blocks, ind_blocks;
@@ -1009,11 +1008,12 @@ static int update_qd(struct gfs2_sbd *sdp, struct 
gfs2_quota_data *qd)
 static int do_glock(struct gfs2_quota_data *qd, int force_refresh,
struct gfs2_holder *q_gh)
 {
-   struct gfs2_sbd *sdp = qd->qd_gl->gl_name.ln_sbd;
+   struct gfs2_sbd *sdp = qd->qd_sbd;
struct gfs2_inode *ip = GFS2_I(sdp->sd_quota_inode);
struct gfs2_holder i_gh;
int error;
 
+   gfs2_assert_warn(sdp, sdp == qd->qd_gl->gl_name.ln_sbd);
 restart:
error = gfs2_glock_nq_init(qd->qd_gl, LM_ST_SHARED, 0, q_gh);
if (error)
@@ -1091,7 +1091,7 @@ int gfs2_quota_lock(struct gfs2_inode *ip, kuid_t uid, 
kgid_t gid)
 
 static int need_sync(struct gfs2_quota_data *qd)
 {
-   struct gfs2_sbd *sdp = qd->qd_gl->gl_name.ln_sbd;
+   struct gfs2_sbd *sdp = qd->qd_sbd;
struct gfs2_tune *gt = >sd_tune;
s64 value;
unsigned int num, den;
@@ -1178,7 +1178,7 @@ void gfs2_quota_unlock(struct gfs2_inode *ip)
 
 static int print_message(struct gfs2_quota_data *qd, char *type)
 {
-   struct gfs2_sbd *sdp = qd->qd_gl->gl_name.ln_sbd;
+   struct gfs2_sbd *sdp = qd->qd_sbd;
 
fs_info(sdp, "quota %s for %s %u\n",
type,
-- 
2.40.1



[Cluster-devel] [PATCH 9/9] gfs2: Fix quota data refcount after cleanup

2023-08-24 Thread Andreas Gruenbacher
[   81.372851][ T5532] CPU: 1 PID: 5532 Comm: syz-executor.0 Not tainted 
6.2.0-rc1-syzkaller-dirty #0
[   81.382080][ T5532] Hardware name: Google Google Compute Engine/Google 
Compute Engine, BIOS Google 01/12/2023
[   81.392343][ T5532] Call Trace:
[   81.395654][ T5532]  
[   81.398603][ T5532]  dump_stack_lvl+0x1b1/0x290
[   81.418421][ T5532]  gfs2_assert_warn_i+0x19a/0x2e0
[   81.423480][ T5532]  gfs2_quota_cleanup+0x4c6/0x6b0
[   81.428611][ T5532]  gfs2_make_fs_ro+0x517/0x610
[   81.457802][ T5532]  gfs2_withdraw+0x609/0x1540
[   81.481452][ T5532]  gfs2_inode_refresh+0xb2d/0xf60
[   81.506658][ T5532]  gfs2_instantiate+0x15e/0x220
[   81.511504][ T5532]  gfs2_glock_wait+0x1d9/0x2a0
[   81.516352][ T5532]  do_sync+0x485/0xc80
[   81.554943][ T5532]  gfs2_quota_sync+0x3da/0x8b0
[   81.559738][ T5532]  gfs2_sync_fs+0x49/0xb0
[   81.564063][ T5532]  sync_filesystem+0xe8/0x220
[   81.568740][ T5532]  generic_shutdown_super+0x6b/0x310
[   81.574112][ T5532]  kill_block_super+0x79/0xd0
[   81.578779][ T5532]  deactivate_locked_super+0xa7/0xf0
[   81.584064][ T5532]  cleanup_mnt+0x494/0x520
[   81.593753][ T5532]  task_work_run+0x243/0x300
[   81.608837][ T5532]  exit_to_user_mode_loop+0x124/0x150
[   81.614232][ T5532]  exit_to_user_mode_prepare+0xb2/0x140
[   81.619820][ T5532]  syscall_exit_to_user_mode+0x26/0x60
[   81.625287][ T5532]  do_syscall_64+0x49/0xb0
[   81.629710][ T5532]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

In this backtrace, gfs2_quota_sync() takes quota data references and
then calls do_sync().  Function do_sync() encounters filesystem
corruption and withdraws the filesystem, wich (among other things) calls
gfs2_quota_cleanup().  Function gfs2_quota_cleanup() wrongly assumes
that nobody is holding any quota data references anymore, and destroys
all quota data objects.  When gfs2_quota_sync() then resumes and
dereferences the quota data objects it is holding, those objects are no
longer there.

Fix that by changing gfs2_quota_cleanup() to only destroy quota data
objects that are no longer in use.  This also means that qd_put() needs
to check if quotas are still active, and destroy quota data objects
itself otherwise.  (We check for the presence of the SDF_JOURNAL_LIVE
flag to decide whether quotas are still active or not.)

Thanks to Edward Adam Davis  for the initial patches.

Link: https://lore.kernel.org/all/2b5e2405f14e8...@google.com
Reported-by: syzbot+3f6a670108ce43356...@syzkaller.appspotmail.com
Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/ops_fstype.c |  1 +
 fs/gfs2/quota.c  | 41 ++---
 fs/gfs2/quota.h  |  1 +
 fs/gfs2/super.c  |  1 +
 4 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 6ea295cee463..4035f61c294d 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1590,6 +1590,7 @@ static int gfs2_reconfigure(struct fs_context *fc)
if ((sb->s_flags ^ fc->sb_flags) & SB_RDONLY) {
if (fc->sb_flags & SB_RDONLY) {
gfs2_make_fs_ro(sdp);
+   gfs2_quota_wait_cleanup(sdp);
} else {
error = gfs2_make_fs_rw(sdp);
if (error)
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 97fdf64148ba..232211359f58 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -109,7 +109,11 @@ static inline void spin_unlock_bucket(unsigned int hash)
 static void gfs2_qd_dealloc(struct rcu_head *rcu)
 {
struct gfs2_quota_data *qd = container_of(rcu, struct gfs2_quota_data, 
qd_rcu);
+   struct gfs2_sbd *sdp = qd->qd_sbd;
+
kmem_cache_free(gfs2_quotad_cachep, qd);
+   if (atomic_dec_and_test(>sd_quota_count))
+   wake_up(>sd_kill_wait);
 }
 
 static void gfs2_qd_dispose(struct gfs2_quota_data *qd)
@@ -143,7 +147,6 @@ static void gfs2_qd_list_dispose(struct list_head *list)
list_del(>qd_lru);
 
gfs2_qd_dispose(qd);
-   atomic_dec(>sd_quota_count);
}
 }
 
@@ -317,13 +320,24 @@ static void qd_hold(struct gfs2_quota_data *qd)
 
 static void qd_put(struct gfs2_quota_data *qd)
 {
+   struct gfs2_sbd *sdp;
+
if (lockref_put_or_lock(>qd_lockref))
return;
 
+   BUG_ON(__lockref_is_dead(>qd_lockref));
+   sdp = qd->qd_sbd;
+   if (unlikely(!test_bit(SDF_JOURNAL_LIVE, >sd_flags))) {
+   lockref_mark_dead(>qd_lockref);
+   spin_unlock(>qd_lockref.lock);
+
+   gfs2_qd_dispose(qd);
+   return;
+   }
+
qd->qd_lockref.count = 0;
list_lru_add(_qd_lru, >qd_lru);
spin_unlock(>qd_lockref.lock);
-
 }
 
 static int slot_get(struct gfs2_quota_data *qd)
@@ -1466,20 +1480,41 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
struct gfs2_quota_data *qd;
LIST_HEAD(dispose);
 
+   BUG_ON(

[Cluster-devel] [PATCH 7/9] gfs2: No more quota complaints after withdraw

2023-08-24 Thread Andreas Gruenbacher
Once a filesystem is withdrawn, don't complain about quota changes
that can't be synced to the main quota file anymore.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/quota.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 01fae6b030e9..fccdb22980e8 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -124,9 +124,11 @@ static void gfs2_qd_dispose(struct gfs2_quota_data *qd)
hlist_bl_del_rcu(>qd_hlist);
spin_unlock_bucket(qd->qd_hash);
 
-   gfs2_assert_warn(sdp, !qd->qd_change);
-   gfs2_assert_warn(sdp, !qd->qd_slot_count);
-   gfs2_assert_warn(sdp, !qd->qd_bh_count);
+   if (!gfs2_withdrawn(sdp)) {
+   gfs2_assert_warn(sdp, !qd->qd_change);
+   gfs2_assert_warn(sdp, !qd->qd_slot_count);
+   gfs2_assert_warn(sdp, !qd->qd_bh_count);
+   }
 
gfs2_glock_put(qd->qd_gl);
call_rcu(>qd_rcu, gfs2_qd_dealloc);
-- 
2.40.1



[Cluster-devel] [PATCH 3/9] gfs2: Rename SDF_DEACTIVATING to SDF_KILL

2023-08-24 Thread Andreas Gruenbacher
Rename the SDF_DEACTIVATING flag to SDF_KILL to make it more obvious
that this relates to the kill_sb filesystem operation.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/glock.c  | 4 ++--
 fs/gfs2/glops.c  | 2 +-
 fs/gfs2/incore.h | 2 +-
 fs/gfs2/ops_fstype.c | 4 ++--
 fs/gfs2/super.c  | 2 +-
 fs/gfs2/sys.c| 2 +-
 6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index e242745cf40f..9cbf8d98489a 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1022,7 +1022,7 @@ static void delete_work_func(struct work_struct *work)
 * step entirely.
 */
if (gfs2_try_evict(gl)) {
-   if (test_bit(SDF_DEACTIVATING, >sd_flags))
+   if (test_bit(SDF_KILL, >sd_flags))
goto out;
if (gfs2_queue_verify_evict(gl))
return;
@@ -1035,7 +1035,7 @@ static void delete_work_func(struct work_struct *work)
GFS2_BLKST_UNLINKED);
if (IS_ERR(inode)) {
if (PTR_ERR(inode) == -EAGAIN &&
-   !test_bit(SDF_DEACTIVATING, >sd_flags) &&
+   !test_bit(SDF_KILL, >sd_flags) &&
gfs2_queue_verify_evict(gl))
return;
} else {
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 54319328b16b..7d6cca467fa1 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -637,7 +637,7 @@ static void iopen_go_callback(struct gfs2_glock *gl, bool 
remote)
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 
if (!remote || sb_rdonly(sdp->sd_vfs) ||
-   test_bit(SDF_DEACTIVATING, >sd_flags))
+   test_bit(SDF_KILL, >sd_flags))
return;
 
if (gl->gl_demote_state == LM_ST_UNLOCKED &&
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 7abb43bb8df0..ab857431dfa4 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -606,7 +606,7 @@ enum {
SDF_REMOTE_WITHDRAW = 13, /* Performing remote recovery */
SDF_WITHDRAW_RECOVERY   = 14, /* Wait for journal recovery when we are
 withdrawing */
-   SDF_DEACTIVATING= 15,
+   SDF_KILL= 15,
SDF_EVICTING= 16,
SDF_FROZEN  = 17,
 };
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 19d19491c0d1..6ea295cee463 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1786,9 +1786,9 @@ static void gfs2_kill_sb(struct super_block *sb)
/*
 * Flush and then drain the delete workqueue here (via
 * destroy_workqueue()) to ensure that any delete work that
-* may be running will also see the SDF_DEACTIVATING flag.
+* may be running will also see the SDF_KILL flag.
 */
-   set_bit(SDF_DEACTIVATING, >sd_flags);
+   set_bit(SDF_KILL, >sd_flags);
gfs2_flush_delete_work(sdp);
destroy_workqueue(sdp->sd_delete_wq);
 
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 9f4d5d6549ee..e0dceef3c9cc 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -546,7 +546,7 @@ void gfs2_make_fs_ro(struct gfs2_sbd *sdp)
 {
int log_write_allowed = test_bit(SDF_JOURNAL_LIVE, >sd_flags);
 
-   if (!test_bit(SDF_DEACTIVATING, >sd_flags))
+   if (!test_bit(SDF_KILL, >sd_flags))
gfs2_flush_delete_work(sdp);
 
if (!log_write_allowed && current == sdp->sd_quotad_process)
diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index 2dfbe2f188dd..a85fe7b92911 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -118,7 +118,7 @@ static ssize_t status_show(struct gfs2_sbd *sdp, char *buf)
 test_bit(SDF_WITHDRAW_IN_PROG, ),
 test_bit(SDF_REMOTE_WITHDRAW, ),
 test_bit(SDF_WITHDRAW_RECOVERY, ),
-test_bit(SDF_DEACTIVATING, ),
+test_bit(SDF_KILL, ),
 sdp->sd_log_error,
 rwsem_is_locked(>sd_log_flush_lock),
 sdp->sd_log_num_revoke,
-- 
2.40.1



[Cluster-devel] [PATCH 4/9] gfs2: Fix wrong quota shrinker return value

2023-08-24 Thread Andreas Gruenbacher
Function gfs2_qd_isolate must only return LRU_REMOVED when removing the
item from the lru list; otherwise, the number of items on the list will
go wrong.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/quota.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 48b9fbffe260..f58072efafc9 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -149,18 +149,22 @@ static enum lru_status gfs2_qd_isolate(struct list_head 
*item,
struct list_lru_one *lru, spinlock_t *lru_lock, void *arg)
 {
struct list_head *dispose = arg;
-   struct gfs2_quota_data *qd = list_entry(item, struct gfs2_quota_data, 
qd_lru);
+   struct gfs2_quota_data *qd =
+   list_entry(item, struct gfs2_quota_data, qd_lru);
+   enum lru_status status;
 
if (!spin_trylock(>qd_lockref.lock))
return LRU_SKIP;
 
+   status = LRU_SKIP;
if (qd->qd_lockref.count == 0) {
lockref_mark_dead(>qd_lockref);
list_lru_isolate_move(lru, >qd_lru, dispose);
+   status = LRU_REMOVED;
}
 
spin_unlock(>qd_lockref.lock);
-   return LRU_REMOVED;
+   return status;
 }
 
 static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
-- 
2.40.1



[Cluster-devel] [PATCH 8/9] gfs2: Fix initial quota data refcount

2023-08-24 Thread Andreas Gruenbacher
Fix the refcount of quota data objects created directly by
gfs2_quota_init(): those are placed into the in-memory quota "database"
for eventual syncing to the main quota file, but they are not actively
held and should thus have an initial refcount of 0.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/quota.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index fccdb22980e8..97fdf64148ba 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -228,7 +228,7 @@ static struct gfs2_quota_data *qd_alloc(unsigned hash, 
struct gfs2_sbd *sdp, str
return NULL;
 
qd->qd_sbd = sdp;
-   qd->qd_lockref.count = 1;
+   qd->qd_lockref.count = 0;
spin_lock_init(>qd_lockref.lock);
qd->qd_id = qid;
qd->qd_slot = -1;
@@ -290,6 +290,7 @@ static int qd_get(struct gfs2_sbd *sdp, struct kqid qid,
spin_lock_bucket(hash);
*qdp = qd = gfs2_qd_search_bucket(hash, sdp, qid);
if (qd == NULL) {
+   new_qd->qd_lockref.count++;
*qdp = new_qd;
list_add(_qd->qd_list, >sd_quota_list);
hlist_bl_add_head_rcu(_qd->qd_hlist, _hash_table[hash]);
-- 
2.40.1



[Cluster-devel] [PATCH 2/9] gfs2: Rename sd_{ glock => kill }_wait

2023-08-24 Thread Andreas Gruenbacher
Rename sd_glock_wait to sd_kill_wait: we'll use it for other things
related to "killing" a filesystem on unmount soon (kill_sb).

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/glock.c  | 6 +++---
 fs/gfs2/incore.h | 2 +-
 fs/gfs2/ops_fstype.c | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 675bfec77706..e242745cf40f 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -176,7 +176,7 @@ void gfs2_glock_free(struct gfs2_glock *gl)
wake_up_glock(gl);
call_rcu(>gl_rcu, gfs2_glock_dealloc);
if (atomic_dec_and_test(>sd_glock_disposal))
-   wake_up(>sd_glock_wait);
+   wake_up(>sd_kill_wait);
 }
 
 /**
@@ -1231,7 +1231,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
 out_free:
gfs2_glock_dealloc(>gl_rcu);
if (atomic_dec_and_test(>sd_glock_disposal))
-   wake_up(>sd_glock_wait);
+   wake_up(>sd_kill_wait);
 
 out:
return ret;
@@ -2188,7 +2188,7 @@ void gfs2_gl_hash_clear(struct gfs2_sbd *sdp)
flush_workqueue(glock_workqueue);
glock_hash_walk(clear_glock, sdp);
flush_workqueue(glock_workqueue);
-   wait_event_timeout(sdp->sd_glock_wait,
+   wait_event_timeout(sdp->sd_kill_wait,
   atomic_read(>sd_glock_disposal) == 0,
   HZ * 600);
glock_hash_walk(dump_glock_func, sdp);
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 04f2d78e8658..7abb43bb8df0 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -716,7 +716,7 @@ struct gfs2_sbd {
struct gfs2_glock *sd_rename_gl;
struct gfs2_glock *sd_freeze_gl;
struct work_struct sd_freeze_work;
-   wait_queue_head_t sd_glock_wait;
+   wait_queue_head_t sd_kill_wait;
wait_queue_head_t sd_async_glock_wait;
atomic_t sd_glock_disposal;
struct completion sd_locking_init;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 8a27957dbfee..19d19491c0d1 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -87,7 +87,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
set_bit(SDF_NOJOURNALID, >sd_flags);
gfs2_tune_init(>sd_tune);
 
-   init_waitqueue_head(>sd_glock_wait);
+   init_waitqueue_head(>sd_kill_wait);
init_waitqueue_head(>sd_async_glock_wait);
atomic_set(>sd_glock_disposal, 0);
init_completion(>sd_locking_init);
-- 
2.40.1



[Cluster-devel] [PATCH 5/9] gfs2: Use gfs2_qd_dispose in gfs2_quota_cleanup

2023-08-24 Thread Andreas Gruenbacher
Change gfs2_quota_cleanup() to move the quota data objects to dispose of
on a dispose list and call gfs2_qd_dispose() on that list, like
gfs2_qd_shrink_scan() does, instead of disposing of the quota data
objects directly.

This may look a bit pointless by itself, but it will make more sense in
combination with a fix that follows.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/quota.c | 26 --
 1 file changed, 4 insertions(+), 22 deletions(-)

diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index f58072efafc9..976bf1097706 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -1459,35 +1459,17 @@ int gfs2_quota_init(struct gfs2_sbd *sdp)
 
 void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
 {
-   struct list_head *head = >sd_quota_list;
struct gfs2_quota_data *qd;
+   LIST_HEAD(dispose);
 
spin_lock(_lock);
-   while (!list_empty(head)) {
-   qd = list_last_entry(head, struct gfs2_quota_data, qd_list);
-
-   list_del(>qd_list);
-
-   /* Also remove if this qd exists in the reclaim list */
+   list_for_each_entry(qd, >sd_quota_list, qd_list) {
list_lru_del(_qd_lru, >qd_lru);
-   atomic_dec(>sd_quota_count);
-   spin_unlock(_lock);
-
-   spin_lock_bucket(qd->qd_hash);
-   hlist_bl_del_rcu(>qd_hlist);
-   spin_unlock_bucket(qd->qd_hash);
-
-   gfs2_assert_warn(sdp, !qd->qd_change);
-   gfs2_assert_warn(sdp, !qd->qd_slot_count);
-   gfs2_assert_warn(sdp, !qd->qd_bh_count);
-
-   gfs2_glock_put(qd->qd_gl);
-   call_rcu(>qd_rcu, gfs2_qd_dealloc);
-
-   spin_lock(_lock);
+   list_add(>qd_lru, );
}
spin_unlock(_lock);
 
+   gfs2_qd_dispose();
gfs2_assert_warn(sdp, !atomic_read(>sd_quota_count));
 
kvfree(sdp->sd_quota_bitmap);
-- 
2.40.1



[Cluster-devel] [PATCH 6/9] gfs2: Factor out duplicate quota data disposal code

2023-08-24 Thread Andreas Gruenbacher
Rename gfs2_qd_dispose() to gfs2_qd_dispose_list().  Move some code
duplicated in gfs2_qd_dispose_list() and gfs2_quota_cleanup() into a
new gfs2_qd_dispose() function.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/quota.c | 47 ---
 1 file changed, 24 insertions(+), 23 deletions(-)

diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 976bf1097706..01fae6b030e9 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -112,35 +112,36 @@ static void gfs2_qd_dealloc(struct rcu_head *rcu)
kmem_cache_free(gfs2_quotad_cachep, qd);
 }
 
-static void gfs2_qd_dispose(struct list_head *list)
+static void gfs2_qd_dispose(struct gfs2_quota_data *qd)
 {
-   struct gfs2_quota_data *qd;
-   struct gfs2_sbd *sdp;
+   struct gfs2_sbd *sdp = qd->qd_sbd;
 
-   while (!list_empty(list)) {
-   qd = list_first_entry(list, struct gfs2_quota_data, qd_lru);
-   sdp = qd->qd_sbd;
+   spin_lock(_lock);
+   list_del(>qd_list);
+   spin_unlock(_lock);
 
-   list_del(>qd_lru);
+   spin_lock_bucket(qd->qd_hash);
+   hlist_bl_del_rcu(>qd_hlist);
+   spin_unlock_bucket(qd->qd_hash);
 
-   /* Free from the filesystem-specific list */
-   spin_lock(_lock);
-   list_del(>qd_list);
-   spin_unlock(_lock);
+   gfs2_assert_warn(sdp, !qd->qd_change);
+   gfs2_assert_warn(sdp, !qd->qd_slot_count);
+   gfs2_assert_warn(sdp, !qd->qd_bh_count);
 
-   spin_lock_bucket(qd->qd_hash);
-   hlist_bl_del_rcu(>qd_hlist);
-   spin_unlock_bucket(qd->qd_hash);
+   gfs2_glock_put(qd->qd_gl);
+   call_rcu(>qd_rcu, gfs2_qd_dealloc);
+}
 
-   gfs2_assert_warn(sdp, !qd->qd_change);
-   gfs2_assert_warn(sdp, !qd->qd_slot_count);
-   gfs2_assert_warn(sdp, !qd->qd_bh_count);
+static void gfs2_qd_list_dispose(struct list_head *list)
+{
+   struct gfs2_quota_data *qd;
 
-   gfs2_glock_put(qd->qd_gl);
-   atomic_dec(>sd_quota_count);
+   while (!list_empty(list)) {
+   qd = list_first_entry(list, struct gfs2_quota_data, qd_lru);
+   list_del(>qd_lru);
 
-   /* Delete it from the common reclaim list */
-   call_rcu(>qd_rcu, gfs2_qd_dealloc);
+   gfs2_qd_dispose(qd);
+   atomic_dec(>sd_quota_count);
}
 }
 
@@ -179,7 +180,7 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker 
*shrink,
freed = list_lru_shrink_walk(_qd_lru, sc,
 gfs2_qd_isolate, );
 
-   gfs2_qd_dispose();
+   gfs2_qd_list_dispose();
 
return freed;
 }
@@ -1469,7 +1470,7 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
}
spin_unlock(_lock);
 
-   gfs2_qd_dispose();
+   gfs2_qd_list_dispose();
gfs2_assert_warn(sdp, !atomic_read(>sd_quota_count));
 
kvfree(sdp->sd_quota_bitmap);
-- 
2.40.1



[Cluster-devel] [PATCH 0/9] gfs2: quota cleanups and fixes

2023-08-24 Thread Andreas Gruenbacher
Bob and I have been looking into the following syzbot-reported quota
bug:

https://lore.kernel.org/all/2b5e2405f14e8...@google.com

The following patches fix that bug.  I've put those patches onto our
for-next branch; they will be submitted upstream in the upcoming merge
window.

https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/log/?h=for-next

Thanks,
Andreas

Andreas Gruenbacher (8):
  gfs2: Rename sd_{ glock => kill }_wait
  gfs2: Rename SDF_DEACTIVATING to SDF_KILL
  gfs2: Fix wrong quota shrinker return value
  gfs2: Use gfs2_qd_dispose in gfs2_quota_cleanup
  gfs2: Factor out duplicate quota data disposal code
  gfs2: No more quota complaints after withdraw
  gfs2: Fix initial quota data refcount
  gfs2: Fix quota data refcount after cleanup

Bob Peterson (1):
  gfs2: Use qd_sbd more consequently

 fs/gfs2/glock.c  |  10 ++--
 fs/gfs2/glops.c  |   2 +-
 fs/gfs2/incore.h |   4 +-
 fs/gfs2/ops_fstype.c |   7 ++-
 fs/gfs2/quota.c  | 139 +--
 fs/gfs2/quota.h  |   1 +
 fs/gfs2/super.c  |   3 +-
 fs/gfs2/sys.c|   2 +-
 8 files changed, 98 insertions(+), 70 deletions(-)

-- 
2.40.1



[Cluster-devel] [PATCH 1/4] gfs2: Switch to wait_event in gfs2_logd

2023-08-24 Thread Andreas Gruenbacher
In gfs2_logd(), switch from an open-coded wait loop to
wait_event_interruptible_timeout().

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/log.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index aa568796207c..d3da259820e3 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -1301,7 +1301,6 @@ int gfs2_logd(void *data)
 {
struct gfs2_sbd *sdp = data;
unsigned long t = 1;
-   DEFINE_WAIT(wait);
 
while (!kthread_should_stop()) {
 
@@ -1338,17 +1337,11 @@ int gfs2_logd(void *data)
 
try_to_freeze();
 
-   do {
-   prepare_to_wait(>sd_logd_waitq, ,
-   TASK_INTERRUPTIBLE);
-   if (!gfs2_ail_flush_reqd(sdp) &&
-   !gfs2_jrnl_flush_reqd(sdp) &&
-   !kthread_should_stop())
-   t = schedule_timeout(t);
-   } while(t && !gfs2_ail_flush_reqd(sdp) &&
-   !gfs2_jrnl_flush_reqd(sdp) &&
-   !kthread_should_stop());
-   finish_wait(>sd_logd_waitq, );
+   t = wait_event_interruptible_timeout(sdp->sd_logd_waitq,
+   gfs2_ail_flush_reqd(sdp) ||
+   gfs2_jrnl_flush_reqd(sdp) ||
+   kthread_should_stop(),
+   t);
}
 
return 0;
-- 
2.40.1



[Cluster-devel] [PATCH 3/4] gfs2: Fix logd wakeup on I/O error

2023-08-24 Thread Andreas Gruenbacher
When quotad detects an I/O error, it sets sd_log_error and then it wakes
up logd to withdraw the filesystem.  However, logd doesn't wake up when
sd_log_error is set.  Fix that.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/log.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index aaca22f2aa2d..abe4397dc59b 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -1340,6 +1340,7 @@ int gfs2_logd(void *data)
test_bit(SDF_FORCE_AIL_FLUSH, >sd_flags) ||
gfs2_ail_flush_reqd(sdp) ||
gfs2_jrnl_flush_reqd(sdp) ||
+   sdp->sd_log_error ||
kthread_should_stop(),
t);
}
-- 
2.40.1



[Cluster-devel] [PATCH 4/4] gfs2: journal flush threshold fixes and cleanup

2023-08-24 Thread Andreas Gruenbacher
Commit f07b35202148 ("GFS2: Made logd daemon take into account log
demand") changed gfs2_ail_flush_reqd() and gfs2_jrnl_flush_reqd() to
take sd_log_blks_needed into account, but the checks in
gfs2_log_commit() were not updated correspondingly.

Once that is fixed, gfs2_jrnl_flush_reqd() and gfs2_ail_flush_reqd() can
be used in gfs2_log_commit().  Make those two helpers available to
gfs2_log_commit() by defining them above gfs2_log_commit().

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/log.c | 34 --
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index abe4397dc59b..addf4ce0bedd 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -1227,6 +1227,21 @@ static void log_refund(struct gfs2_sbd *sdp, struct 
gfs2_trans *tr)
gfs2_log_unlock(sdp);
 }
 
+static inline int gfs2_jrnl_flush_reqd(struct gfs2_sbd *sdp)
+{
+   return atomic_read(>sd_log_pinned) +
+  atomic_read(>sd_log_blks_needed) >=
+  atomic_read(>sd_log_thresh1);
+}
+
+static inline int gfs2_ail_flush_reqd(struct gfs2_sbd *sdp)
+{
+   return sdp->sd_jdesc->jd_blocks -
+  atomic_read(>sd_log_blks_free) +
+  atomic_read(>sd_log_blks_needed) >=
+  atomic_read(>sd_log_thresh2);
+}
+
 /**
  * gfs2_log_commit - Commit a transaction to the log
  * @sdp: the filesystem
@@ -1246,9 +1261,7 @@ void gfs2_log_commit(struct gfs2_sbd *sdp, struct 
gfs2_trans *tr)
 {
log_refund(sdp, tr);
 
-   if (atomic_read(>sd_log_pinned) > 
atomic_read(>sd_log_thresh1) ||
-   ((sdp->sd_jdesc->jd_blocks - atomic_read(>sd_log_blks_free)) >
-   atomic_read(>sd_log_thresh2)))
+   if (gfs2_ail_flush_reqd(sdp) || gfs2_jrnl_flush_reqd(sdp))
wake_up(>sd_logd_waitq);
 }
 
@@ -1271,21 +1284,6 @@ static void gfs2_log_shutdown(struct gfs2_sbd *sdp)
gfs2_assert_warn(sdp, list_empty(>sd_ail2_list));
 }
 
-static inline int gfs2_jrnl_flush_reqd(struct gfs2_sbd *sdp)
-{
-   return (atomic_read(>sd_log_pinned) +
-   atomic_read(>sd_log_blks_needed) >=
-   atomic_read(>sd_log_thresh1));
-}
-
-static inline int gfs2_ail_flush_reqd(struct gfs2_sbd *sdp)
-{
-   unsigned int used_blocks = sdp->sd_jdesc->jd_blocks - 
atomic_read(>sd_log_blks_free);
-
-   return used_blocks + atomic_read(>sd_log_blks_needed) >=
-   atomic_read(>sd_log_thresh2);
-}
-
 /**
  * gfs2_logd - Update log tail as Active Items get flushed to in-place blocks
  * @data: Pointer to GFS2 superblock
-- 
2.40.1



[Cluster-devel] [PATCH 0/4] gfs2: logd cleanups on for-next

2023-08-24 Thread Andreas Gruenbacher
The following four patches related to gfs2's logd daemon are currently
queued up on our for-next branch, to be submitted upstream in the
upcoming merge window.

https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/log/?h=for-next

Thanks,
Andreas

Andreas Gruenbacher (4):
  gfs2: Switch to wait_event in gfs2_logd
  gfs2: low-memory forced flush fixes
  gfs2: Fix logd wakeup on I/O error
  gfs2: journal flush threshold fixes and cleanup

 fs/gfs2/aops.c |  4 +++-
 fs/gfs2/log.c  | 60 ++
 2 files changed, 29 insertions(+), 35 deletions(-)

-- 
2.40.1



[Cluster-devel] [PATCH 2/4] gfs2: low-memory forced flush fixes

2023-08-24 Thread Andreas Gruenbacher
Function gfs2_ail_flush_reqd checks the SDF_FORCE_AIL_FLUSH flag to
determine if an AIL flush should be forced in low-memory situations.
However, it also immediately clears the flag, and when called repeatedly
as in function gfs2_logd, the flag will be lost.  Fix that by pulling
the SDF_FORCE_AIL_FLUSH flag check out of gfs2_ail_flush_reqd.

In addition, in gfs2_writepages, logd needs to be woken up after setting
the SDF_FORCE_AIL_FLUSH flag.

Fixes: b066a4eebd4f ("gfs2: forcibly flush ail to relieve memory pressure")
Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/aops.c | 4 +++-
 fs/gfs2/log.c  | 8 
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 5f02542370c4..d15a10a18962 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -189,8 +189,10 @@ static int gfs2_writepages(struct address_space *mapping,
 * pages held in the ail that it can't find.
 */
ret = iomap_writepages(mapping, wbc, , _writeback_ops);
-   if (ret == 0)
+   if (ret == 0) {
set_bit(SDF_FORCE_AIL_FLUSH, >sd_flags);
+   wake_up(>sd_logd_waitq);
+   }
return ret;
 }
 
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index d3da259820e3..aaca22f2aa2d 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -1282,9 +1282,6 @@ static inline int gfs2_ail_flush_reqd(struct gfs2_sbd 
*sdp)
 {
unsigned int used_blocks = sdp->sd_jdesc->jd_blocks - 
atomic_read(>sd_log_blks_free);
 
-   if (test_and_clear_bit(SDF_FORCE_AIL_FLUSH, >sd_flags))
-   return 1;
-
return used_blocks + atomic_read(>sd_log_blks_needed) >=
atomic_read(>sd_log_thresh2);
 }
@@ -1325,7 +1322,9 @@ int gfs2_logd(void *data)
  GFS2_LFC_LOGD_JFLUSH_REQD);
}
 
-   if (gfs2_ail_flush_reqd(sdp)) {
+   if (test_bit(SDF_FORCE_AIL_FLUSH, >sd_flags) ||
+   gfs2_ail_flush_reqd(sdp)) {
+   clear_bit(SDF_FORCE_AIL_FLUSH, >sd_flags);
gfs2_ail1_start(sdp);
gfs2_ail1_wait(sdp);
gfs2_ail1_empty(sdp, 0);
@@ -1338,6 +1337,7 @@ int gfs2_logd(void *data)
try_to_freeze();
 
t = wait_event_interruptible_timeout(sdp->sd_logd_waitq,
+   test_bit(SDF_FORCE_AIL_FLUSH, >sd_flags) ||
gfs2_ail_flush_reqd(sdp) ||
gfs2_jrnl_flush_reqd(sdp) ||
kthread_should_stop(),
-- 
2.40.1



Re: [Cluster-devel] [PATCH] gfs2: Fix uaf for qda in gfs2_quota_sync

2023-08-23 Thread Andreas Gruenbacher
On Tue, Aug 22, 2023 at 9:32 PM Bob Peterson  wrote:
> On 1/26/23 11:10 PM, eada...@sina.com wrote:
> > From: Edward Adam Davis 
> >
> > [   81.372851][ T5532] CPU: 1 PID: 5532 Comm: syz-executor.0 Not tainted 
> > 6.2.0-rc1-syzkaller-dirty #0
> > [   81.382080][ T5532] Hardware name: Google Google Compute Engine/Google 
> > Compute Engine, BIOS Google 01/12/2023
> > [   81.392343][ T5532] Call Trace:
> > [   81.395654][ T5532]  
> > [   81.398603][ T5532]  dump_stack_lvl+0x1b1/0x290
> > [   81.418421][ T5532]  gfs2_assert_warn_i+0x19a/0x2e0
> > [   81.423480][ T5532]  gfs2_quota_cleanup+0x4c6/0x6b0
> > [   81.428611][ T5532]  gfs2_make_fs_ro+0x517/0x610
> > [   81.457802][ T5532]  gfs2_withdraw+0x609/0x1540
> > [   81.481452][ T5532]  gfs2_inode_refresh+0xb2d/0xf60
> > [   81.506658][ T5532]  gfs2_instantiate+0x15e/0x220
> > [   81.511504][ T5532]  gfs2_glock_wait+0x1d9/0x2a0
> > [   81.516352][ T5532]  do_sync+0x485/0xc80
> > [   81.554943][ T5532]  gfs2_quota_sync+0x3da/0x8b0
> > [   81.559738][ T5532]  gfs2_sync_fs+0x49/0xb0
> > [   81.564063][ T5532]  sync_filesystem+0xe8/0x220
> > [   81.568740][ T5532]  generic_shutdown_super+0x6b/0x310
> > [   81.574112][ T5532]  kill_block_super+0x79/0xd0
> > [   81.578779][ T5532]  deactivate_locked_super+0xa7/0xf0
> > [   81.584064][ T5532]  cleanup_mnt+0x494/0x520
> > [   81.593753][ T5532]  task_work_run+0x243/0x300
> > [   81.608837][ T5532]  exit_to_user_mode_loop+0x124/0x150
> > [   81.614232][ T5532]  exit_to_user_mode_prepare+0xb2/0x140
> > [   81.619820][ T5532]  syscall_exit_to_user_mode+0x26/0x60
> > [   81.625287][ T5532]  do_syscall_64+0x49/0xb0
> > [   81.629710][ T5532]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [   81.636292][ T5532] RIP: 0033:0x7efdd688d517
> > [   81.640728][ T5532] Code: ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 0f 
> > 1f 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 
> > 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > [   81.660550][ T5532] RSP: 002b:7fff34520ce8 EFLAGS: 0246 
> > ORIG_RAX: 00a6
> > [   81.669413][ T5532] RAX:  RBX:  RCX: 
> > 7efdd688d517
> > [   81.677403][ T5532] RDX: 7fff34520db9 RSI: 000a RDI: 
> > 7fff34520db0
> > [   81.685388][ T5532] RBP: 7fff34520db0 R08:  R09: 
> > 7fff34520b80
> > [   81.695973][ T5532] R10: 55ca38b3 R11: 0246 R12: 
> > 7efdd68e6b24
> > [   81.704152][ T5532] R13: 7fff34521e70 R14: 55ca3810 R15: 
> > 7fff34521eb0
> > [   81.712868][ T5532]  
> >
> > The function "gfs2_quota_cleanup()" may be called in the function 
> > "do_sync()",
> > This will cause the qda obtained in the function "qd_check_sync" to be 
> > released, resulting in the occurrence of uaf.
> > In order to avoid this uaf, we can increase the judgment of 
> > "sdp->sd_quota_bitmap" released in the function
> > "gfs2_quota_cleanup" to confirm that "sdp->sd_quota_list" has been released.
> >
> > Link: https://lore.kernel.org/all/2b5e2405f14e8...@google.com
> > Reported-and-tested-by: 
> > syzbot+3f6a670108ce43356...@syzkaller.appspotmail.com
> > Signed-off-by: Edward Adam Davis 
> > ---
> >   fs/gfs2/quota.c | 3 +++
> >   1 file changed, 3 insertions(+)
> >
> > diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> > index 1ed1722..4cf66bd 100644
> > --- a/fs/gfs2/quota.c
> > +++ b/fs/gfs2/quota.c
> > @@ -1321,6 +1321,9 @@ int gfs2_quota_sync(struct super_block *sb, int type)
> >   qda[x]->qd_sync_gen =
> >   sdp->sd_quota_sync_gen;
> >
> > + if (!sdp->sd_quota_bitmap)
> > + break;
> > +
> >   for (x = 0; x < num_qd; x++)
> >   qd_unlock(qda[x]);
> >   }
>
> Hi Edward,
>
> Can you try to recreate this problem on a newer version of the gfs2 code?

The problem is that gfs2_quota_sync() is holding references to
gfs2_quota_data objects when the filesystem decides to withdraw. The
withdraw code calls gfs2_quota_cleanup(), which wipes away the
gfs2_quota_data objects that gfs2_quota_sync() is still referencing.
So that reference counting needs to be fixed. Alternatively, if we
could delay the withdraw until the end of gfs2_quota_sync(), we would
be fine as well, but we don't have that kind of mechanism.

> In the linux-gfs2 repository I've got a branch called "bobquota" that
> has a bunch of patches related to quota syncing. I don't know if these
> will fix your problem, but it's worth trying.

That branch doesn't change the reference counting. It doesn't address this bug.

> The thing is, the qda array should have been populated by previous
> calls, like qd_fish and such, and should be okay to release by
> quota_cleanup.
>
> I can tell you this:
>
> In the call trace above, function do_sync tried to lock an inode glock,
> which tried to 

[Cluster-devel] [PATCH 2/3] gfs2: Remove LM_FLAG_PRIORITY flag

2023-08-09 Thread Andreas Gruenbacher
The last user of this flag was removed in commit b77b4a4815a9 ("gfs2:
Rework freeze / thaw logic").

Signed-off-by: Andreas Gruenbacher 
---
 Documentation/filesystems/gfs2-glocks.rst |  3 +--
 fs/gfs2/glock.c   | 22 ++
 fs/gfs2/glock.h   |  9 -
 fs/gfs2/lock_dlm.c|  5 -
 4 files changed, 7 insertions(+), 32 deletions(-)

diff --git a/Documentation/filesystems/gfs2-glocks.rst 
b/Documentation/filesystems/gfs2-glocks.rst
index d14f230f0b12..93a690b9bcf2 100644
--- a/Documentation/filesystems/gfs2-glocks.rst
+++ b/Documentation/filesystems/gfs2-glocks.rst
@@ -20,8 +20,7 @@ The gl_holders list contains all the queued lock requests (not
 just the holders) associated with the glock. If there are any
 held locks, then they will be contiguous entries at the head
 of the list. Locks are granted in strictly the order that they
-are queued, except for those marked LM_FLAG_PRIORITY which are
-used only during recovery, and even then only for journal locks.
+are queued.
 
 There are three lock states that users of the glock layer can request,
 namely shared (SH), deferred (DF) and exclusive (EX). Those translate
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index d0c82b721f90..0a3222651869 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -591,8 +591,7 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
if (gh && !test_bit(GLF_DEMOTE_IN_PROGRESS, >gl_flags)) {
/* move to back of queue and try next entry */
if (ret & LM_OUT_CANCELED) {
-   if ((gh->gh_flags & LM_FLAG_PRIORITY) == 0)
-   list_move_tail(>gh_list, 
>gl_holders);
+   list_move_tail(>gh_list, >gl_holders);
gh = find_first_waiter(gl);
gl->gl_target = gh->gh_state;
goto retry;
@@ -679,8 +678,7 @@ __acquires(>gl_lockref.lock)
gh && !(gh->gh_flags & LM_FLAG_NOEXP))
goto skip_inval;
 
-   lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP |
- LM_FLAG_PRIORITY);
+   lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP);
GLOCK_BUG_ON(gl, gl->gl_state == target);
GLOCK_BUG_ON(gl, gl->gl_state == gl->gl_target);
if ((target == LM_ST_UNLOCKED || target == LM_ST_DEFERRED) &&
@@ -1515,27 +1513,21 @@ __acquires(>gl_lockref.lock)
}
if (test_bit(HIF_HOLDER, >gh_iflags))
continue;
-   if (unlikely((gh->gh_flags & LM_FLAG_PRIORITY) && !insert_pt))
-   insert_pt = >gh_list;
}
trace_gfs2_glock_queue(gh, 1);
gfs2_glstats_inc(gl, GFS2_LKS_QCOUNT);
gfs2_sbstats_inc(gl, GFS2_LKS_QCOUNT);
if (likely(insert_pt == NULL)) {
list_add_tail(>gh_list, >gl_holders);
-   if (unlikely(gh->gh_flags & LM_FLAG_PRIORITY))
-   goto do_cancel;
return;
}
list_add_tail(>gh_list, insert_pt);
 do_cancel:
gh = list_first_entry(>gl_holders, struct gfs2_holder, gh_list);
-   if (!(gh->gh_flags & LM_FLAG_PRIORITY)) {
-   spin_unlock(>gl_lockref.lock);
-   if (sdp->sd_lockstruct.ls_ops->lm_cancel)
-   sdp->sd_lockstruct.ls_ops->lm_cancel(gl);
-   spin_lock(>gl_lockref.lock);
-   }
+   spin_unlock(>gl_lockref.lock);
+   if (sdp->sd_lockstruct.ls_ops->lm_cancel)
+   sdp->sd_lockstruct.ls_ops->lm_cancel(gl);
+   spin_lock(>gl_lockref.lock);
return;
 
 trap_recursive:
@@ -2227,8 +2219,6 @@ static const char *hflags2str(char *buf, u16 flags, 
unsigned long iflags)
*p++ = 'e';
if (flags & LM_FLAG_ANY)
*p++ = 'A';
-   if (flags & LM_FLAG_PRIORITY)
-   *p++ = 'p';
if (flags & LM_FLAG_NODE_SCOPE)
*p++ = 'n';
if (flags & GL_ASYNC)
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 1f1ba92c15a8..c8685ca7d2a2 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -68,14 +68,6 @@ enum {
  * also be granted in SHARED.  The preferred state is whichever is compatible
  * with other granted locks, or the specified state if no other locks exist.
  *
- * LM_FLAG_PRIORITY
- * Override fairness considerations.  Suppose a lock is held in a shared state
- * and there is a pending request for the deferred state.  A shared lock
- * request with the priority flag would be allowed to bypass the deferred
- * request and directly join the other shared lock.  A shared lock request
- 

[Cluster-devel] [PATCH 1/3] gfs2: do_promote cleanup

2023-08-09 Thread Andreas Gruenbacher
Change function do_promote to return true on success, and false
otherwise.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/glock.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 1438e7465e30..d0c82b721f90 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -468,10 +468,10 @@ int gfs2_instantiate(struct gfs2_holder *gh)
  * do_promote - promote as many requests as possible on the current queue
  * @gl: The glock
  * 
- * Returns: 1 if there is a blocked holder at the head of the list
+ * Returns true on success (i.e., progress was made or there are no waiters).
  */
 
-static int do_promote(struct gfs2_glock *gl)
+static bool do_promote(struct gfs2_glock *gl)
 {
struct gfs2_holder *gh, *current_gh;
 
@@ -484,10 +484,10 @@ static int do_promote(struct gfs2_glock *gl)
 * If we get here, it means we may not grant this
 * holder for some reason. If this holder is at the
 * head of the list, it means we have a blocked holder
-* at the head, so return 1.
+* at the head, so return false.
 */
if (list_is_first(>gh_list, >gl_holders))
-   return 1;
+   return false;
do_error(gl, 0);
break;
}
@@ -497,7 +497,7 @@ static int do_promote(struct gfs2_glock *gl)
if (!current_gh)
current_gh = gh;
}
-   return 0;
+   return true;
 }
 
 /**
@@ -834,7 +834,7 @@ __acquires(>gl_lockref.lock)
} else {
if (test_bit(GLF_DEMOTE, >gl_flags))
gfs2_demote_wake(gl);
-   if (do_promote(gl) == 0)
+   if (do_promote(gl))
goto out_unlock;
gh = find_first_waiter(gl);
gl->gl_target = gh->gh_state;
-- 
2.40.1



[Cluster-devel] [PATCH 3/3] gfs2: conversion deadlock do_promote bypass

2023-08-09 Thread Andreas Gruenbacher
From: Bob Peterson 

Consider the following case:
1. A glock is held in shared mode.
2. A process requests the glock in exclusive mode (rename).
3. Before the lock is granted, more processes (read / ls) request the
   glock in shared mode again.
4. gfs2 sends a request to dlm for the lock in exclusive mode because
   that holder is at the head of the queue.
5. Somehow the dlm request gets canceled, so dlm sends us back a
   response with state == LM_ST_SHARED and LM_OUT_CANCELED.  So at that
   point, the glock is still held in shared mode.
6. finish_xmote gets called to process the response from dlm. It detects
   that the glock is not in the requested mode and no demote is in
   progress, so it moves the canceled holder to the tail of the queue
   and finds the new holder at the head of the queue.  That holder is
   requesting the glock in shared mode.
7. finish_xmote calls do_xmote to transition the glock into shared mode,
   but the glock is already in shared mode and so do_xmote complains
   about that with:
GLOCK_BUG_ON(gl, gl->gl_state == gl->gl_target);

Instead, in finish_xmote, after moving the canceled holder to the tail
of the queue, check if any new holders can be granted.  Only call
do_xmote to repeat the dlm request if the holder at the head of the
queue is requesting the glock in a mode that is incompatible with the
mode the glock is currently held in.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 0a3222651869..6e8200d96879 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -594,6 +594,8 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
list_move_tail(>gh_list, >gl_holders);
gh = find_first_waiter(gl);
gl->gl_target = gh->gh_state;
+   if (do_promote(gl))
+   goto out;
goto retry;
}
/* Some error or failed "try lock" - report it */
-- 
2.40.1



Re: [Cluster-devel] [PATCH] gfs2: conversion deadlock do_promote bypass

2023-08-09 Thread Andreas Gruenbacher
Hi Bob,

On Wed, Jul 26, 2023 at 8:36 PM Bob Peterson  wrote:
> I know the description is vague or hard to grasp, but it's hard to be
> succinct for this problem.

as discussed off-list, this one needed a bit more work. I've just
posted the updated version.

Thanks,
Andreas



[Cluster-devel] [GIT PULL] gfs2 fixes

2023-08-08 Thread Andreas Gruenbacher
Hi Linus,

please consider pulling the following two gfs2 fixes.

Thanks,
Andreas

The following changes since commit 94c76955e86a5a4f16a1d690b66dcc268c156e6a:

  Merge tag 'gfs2-v6.4-rc5-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 (2023-07-04 
11:45:16 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
tags/gfs2-v6.4-fixes

for you to fetch changes up to 0be8432166a61abc537e1247e530f4b85970b56b:

  gfs2: Don't use filemap_splice_read (2023-08-07 18:42:04 +0200)


gfs2 fixes

- Fix a freeze consistency check in gfs2_trans_add_meta().

- Don't use filemap_splice_read as it can cause deadlocks on gfs2.


Andreas Gruenbacher (1):
  gfs2: Fix freeze consistency check in gfs2_trans_add_meta

Bob Peterson (1):
  gfs2: Don't use filemap_splice_read

 fs/gfs2/file.c  |  4 ++--
 fs/gfs2/trans.c | 14 ++
 2 files changed, 12 insertions(+), 6 deletions(-)



[Cluster-devel] [PATCH] gfs2: Don't use filemap_splice_read

2023-08-04 Thread Andreas Gruenbacher
From: Bob Peterson 

Starting with patch 2cb1e08985, gfs2 started using the new function
filemap_splice_read rather than the old (and subsequently deleted)
function generic_file_splice_read.

filemap_splice_read works by taking references to a number of folios in
the page cache and splicing those folios into a pipe.  The folios are
then read from the pipe and the folio references are dropped.  This can
take an arbitrary amount of time.  We cannot allow that in gfs2 because
those folio references will pin the inode glock to the node and prevent
it from being demoted, which can lead to cluster-wide deadlocks.

Instead, use copy_splice_read.

(In addition, the old generic_file_splice_read called into ->read_iter,
which called gfs2_file_read_iter, which took the inode glock during the
operation.  The new filemap_splice_read interface does not take the
inode glock anymore.  This is fixable, but it still wouldn't prevent
cluster-wide deadlocks.)

Fixes: 2cb1e08985e3 ("splice: Use filemap_splice_read() instead of 
generic_file_splice_read()")
Signed-off-by: Bob Peterson 
Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/file.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 1bf3c4453516..b43fa8b8fc05 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1578,7 +1578,7 @@ const struct file_operations gfs2_file_fops = {
.fsync  = gfs2_fsync,
.lock   = gfs2_lock,
.flock  = gfs2_flock,
-   .splice_read= filemap_splice_read,
+   .splice_read= copy_splice_read,
.splice_write   = gfs2_file_splice_write,
.setlease   = simple_nosetlease,
.fallocate  = gfs2_fallocate,
@@ -1609,7 +1609,7 @@ const struct file_operations gfs2_file_fops_nolock = {
.open   = gfs2_open,
.release= gfs2_release,
.fsync  = gfs2_fsync,
-   .splice_read= filemap_splice_read,
+   .splice_read= copy_splice_read,
.splice_write   = gfs2_file_splice_write,
.setlease   = generic_setlease,
.fallocate  = gfs2_fallocate,
-- 
2.40.1



[Cluster-devel] [PATCH] gfs2: Fix freeze consistency check in gfs2_trans_add_meta

2023-08-04 Thread Andreas Gruenbacher
Function gfs2_trans_add_meta() checks for the SDF_FROZEN flag to make
sure that no buffers are added to a transaction while the filesystem is
frozen.  With the recent freeze/thaw rework, the SDF_FROZEN flag is
cleared after thaw_super() is called, which is sufficient for
serializing freeze/thaw.

However, other filesystem operations started after thaw_super() may now
be calling gfs2_trans_add_meta() before the SDF_FROZEN flag is cleared,
which will trigger the SDF_FROZEN check in gfs2_trans_add_meta().  Fix
that by checking the s_writers.frozen state instead.

In addition, make sure not to call gfs2_assert_withdraw() with the
sd_log_lock spin lock held.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/trans.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index ec1631257978..6a58880668f7 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -230,9 +230,11 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct 
buffer_head *bh)
 {
 
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
+   struct super_block *sb = sdp->sd_vfs;
struct gfs2_bufdata *bd;
struct gfs2_meta_header *mh;
struct gfs2_trans *tr = current->journal_info;
+   bool withdraw = false;
 
lock_buffer(bh);
if (buffer_pinned(bh)) {
@@ -266,9 +268,10 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct 
buffer_head *bh)
   (unsigned long long)bd->bd_bh->b_blocknr);
BUG();
}
-   if (unlikely(test_bit(SDF_FROZEN, >sd_flags))) {
+   if (unlikely(sb->s_writers.frozen == SB_FREEZE_COMPLETE)) {
fs_info(sdp, "GFS2:adding buf while frozen\n");
-   gfs2_assert_withdraw(sdp, 0);
+   withdraw = true;
+   goto out_unlock;
}
if (unlikely(gfs2_withdrawn(sdp))) {
fs_info(sdp, "GFS2:adding buf while withdrawn! 0x%llx\n",
@@ -281,6 +284,8 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct 
buffer_head *bh)
tr->tr_num_buf_new++;
 out_unlock:
gfs2_log_unlock(sdp);
+   if (withdraw)
+   gfs2_assert_withdraw(sdp, 0);
 out:
unlock_buffer(bh);
 }
-- 
2.40.1



Re: [Cluster-devel] [bug report] gfs2: Use mapping->gfp_mask for metadata inodes

2023-08-02 Thread Andreas Gruenbacher
On Wed, Aug 2, 2023 at 9:35 AM Dan Carpenter  wrote:
> Hello Andreas Gruenbacher,
>
> The patch 8f18190e3173: "gfs2: Use mapping->gfp_mask for metadata
> inodes" from Jul 26, 2023 (linux-next), leads to the following Smatch
> static checker warning:
>
> fs/gfs2/inode.c:286 gfs2_lookup_simple()
> error: 'inode' dereferencing possible ERR_PTR()
>
> fs/gfs2/inode.c
> 268 struct inode *gfs2_lookup_simple(struct inode *dip, const char *name)
> 269 {
> 270 struct qstr qstr;
> 271 struct inode *inode;
> 272 gfs2_str2qstr(, name);
> 273 inode = gfs2_lookupi(dip, , 1);
> 274 /* gfs2_lookupi has inconsistent callers: vfs
> 275  * related routines expect NULL for no entry found,
> 276  * gfs2_lookup_simple callers expect ENOENT
> 277  * and do not check for NULL.
>
> This comment is ancient.  I think how gfs2_lookupi() works is that if
> there is an error it returns an error code, but if the file does not
> exist it returns NULL.  (This is just based on vague assumptions about
> how mixed error pointer/NULL functions work).
>
> 278  */
> 279 if (inode == NULL)
> 280 return ERR_PTR(-ENOENT);
> 281
> 282 /*
> 283  * Must not call back into the filesystem when allocating
> 284  * pages in the metadata inode's address space.
> 285  */
> --> 286 mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
>  
> error pointer dereference

Ah, right, thanks for pointing this out. I've pushed a fix.

Andreas

> 287
> 288 return inode;
> 289 }
>
> regards,
> dan carpenter



[Cluster-devel] [PATCH 2/2] gfs: Don't use GFP_NOFS in gfs2_unstuff_dinode

2023-07-31 Thread Andreas Gruenbacher
Revert the rest of commit 220cca2a4f58 ("GFS2: Change truncate page
allocation to be GFP_NOFS"):

In gfs2_unstuff_dinode(), there is no need to carry out the page cache
allocation under GFP_NOFS because inodes on the "regular" filesystem are
never un-inlined under memory pressure, so switch back from
find_or_create_page() to grab_cache_page() here as well.

Inodes on the "metadata" filesystem can theoretically be un-inlined
under memory pressure, but any page cache allocations in that context
would happen in GFP_NOFS context because those inodes have
inode->i_mapping->gfp_mask set to GFP_NOFS (see the previous patch).

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/bmap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 8d611fbcf0bd..c2f0ed76a2b6 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -161,7 +161,7 @@ int gfs2_unstuff_dinode(struct gfs2_inode *ip)
int error;
 
down_write(>i_rw_mutex);
-   page = find_or_create_page(inode->i_mapping, 0, GFP_NOFS);
+   page = grab_cache_page(inode->i_mapping, 0);
error = -ENOMEM;
if (!page)
goto out;
-- 
2.40.1



[Cluster-devel] [PATCH 1/2] gfs2: Use mapping->gfp_mask for metadata inodes

2023-07-31 Thread Andreas Gruenbacher
Set mapping->gfp mask to GFP_NOFS for all metadata inodes so that
allocating pages in the address space of those inodes won't call back
into the filesystem.  This allows to switch back from
find_or_create_page() to grab_cache_page() in two places.

Partially reverts commit 220cca2a4f58 ("GFS2: Change truncate page
allocation to be GFP_NOFS").

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/inode.c | 10 --
 fs/gfs2/lops.c  |  7 +++
 fs/gfs2/quota.c |  2 +-
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 17c994a0c0d0..85ff065ac90c 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -278,8 +278,14 @@ struct inode *gfs2_lookup_simple(struct inode *dip, const 
char *name)
 */
if (inode == NULL)
return ERR_PTR(-ENOENT);
-   else
-   return inode;
+
+   /*
+* Must not call back into the filesystem when allocating
+* pages in the metadata inode's address space.
+*/
+   mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
+
+   return inode;
 }
 
 
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 251322b01631..483f69807062 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -456,7 +456,7 @@ static bool gfs2_jhead_pg_srch(struct gfs2_jdesc *jd,
  * Find the folio with 'index' in the journal's mapping. Search the folio for
  * the journal head if requested (cleanup == false). Release refs on the
  * folio so the page cache can reclaim it. We grabbed a
- * reference on this folio twice, first when we did a find_or_create_page()
+ * reference on this folio twice, first when we did a grab_cache_page()
  * to obtain the folio to add it to the bio and second when we do a
  * filemap_get_folio() here to get the folio to wait on while I/O on it is 
being
  * completed.
@@ -481,7 +481,7 @@ static void gfs2_jhead_process_page(struct gfs2_jdesc *jd, 
unsigned long index,
if (!*done)
*done = gfs2_jhead_pg_srch(jd, head, >page);
 
-   /* filemap_get_folio() and the earlier find_or_create_page() */
+   /* filemap_get_folio() and the earlier grab_cache_page() */
folio_put_refs(folio, 2);
 }
 
@@ -535,8 +535,7 @@ int gfs2_find_jhead(struct gfs2_jdesc *jd, struct 
gfs2_log_header_host *head,
 
for (; block < je->lblock + je->blocks; block++, dblock++) {
if (!page) {
-   page = find_or_create_page(mapping,
-   block >> shift, GFP_NOFS);
+   page = grab_cache_page(mapping, block >> shift);
if (!page) {
ret = -ENOMEM;
done = true;
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 704192b73605..96d41ee034d7 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -724,7 +724,7 @@ static int gfs2_write_buf_to_page(struct gfs2_inode *ip, 
unsigned long index,
blk = index << (PAGE_SHIFT - sdp->sd_sb.sb_bsize_shift);
boff = off % bsize;
 
-   page = find_or_create_page(mapping, index, GFP_NOFS);
+   page = grab_cache_page(mapping, index);
if (!page)
return -ENOMEM;
if (!page_has_buffers(page))
-- 
2.40.1



Re: [Cluster-devel] [PATCHv4 v6.5-rc2 3/3] fs: dlm: fix F_CANCELLK to cancel pending request

2023-07-24 Thread Andreas Gruenbacher
On Fri, Jul 21, 2023 at 8:55 PM Alexander Aring  wrote:
> Hi,
>
> On Fri, Jul 21, 2023 at 12:25 PM Andreas Gruenbacher
>  wrote:
> >
> > On Thu, Jul 20, 2023 at 2:22 PM Alexander Aring  wrote:
> > > This patch fixes the current handling of F_CANCELLK by not just doing a
> > > unlock as we need to try to cancel a lock at first. A unlock makes sense
> > > on a non-blocking lock request but if it's a blocking lock request we
> > > need to cancel the request until it's not granted yet. This patch is 
> > > fixing
> > > this behaviour by first try to cancel a lock request and if it's failed
> > > it's unlocking the lock which seems to be granted.
> >
> > I don't like how this is implemented, but the problem already starts
> > in commit 57e2c2f2d94c ("fs: dlm: fix mismatch of plock results from
> > userspace"). That commit relies on how dlm_controld is implemented
> > internally; it requires dlm_controld to keep all replies to
> > non-blocking requests in perfect order. The correctness of that
> > approach is more difficult to argue for than it should be, the code
> > might break in non-obvious ways, and it limits the kinds of things
> > dlm_controld will be able to do in the future. And this change relies
> > on the same flawed design.
> >
>
> I agree it is not the best design and I mentioned it in my previous
> patch series versions [0]:
>
> TODO we should move to a instance reference from request and reply and
> not go over lock states, but it seems going over lock states and get
> the instance does not make any problems (at least there were no issues
> found yet) but it's much cleaner to not think about all possible
> special cases and states to instance has no 1:1 mapping anymore.

I guess by /go over lock states/, you mean what dev_write() does --
compare the request and response fields to match requests with
responses as well as possible, based on the information that's in
struct dlm_plock_info today.

> > Instead, when noticing that we don't have a way to uniquely identify
> > requests, such a way should just have been introduced. The CANCEL
> > request would then carry the same unique identifier as the original
> > locking request, dlm_controld would reply either to the original
> > locking request or to the cancel request, and the kernel would have no
> > problem matching the reply to the request.
> >
> > But without unique request identifiers, we now end up with those
> > redundant -ENOENT replies to CANCEL requests and with those
> > essentially duplicate entries for the same request on recv_list. Not
> > great.
> >
>
> There is no reference of lock request instances between kernel and
> user yet. It does it by having a match if it's the same lock request.
> The whole mechanism works like this, but one reason why we recently
> had problems with it. A lock request is the same in the sense that
> they are being granted at the same time. So far we did not experience
> any problems with that... but as mentioned in [0] you need to think
> about all potential cases.

I have no idea what you're talking about.

Let me point out one thing though: when dlm_controld replies to
multiple pending locking requests "in one go", without having to wait
on any unlocks to happen, this doesn't mean that those requests are
all "the same" at all. The requests may be for the same "actor" on the
kernel side, but they don't have to be. As such, it does make sense to
treat each of those requests independently.

> NOTE: that even F_CANCELLK does not give you a unique reference of the
> original locking request, it works over matching the field of struct
> file_lock... There is already the same problem. The struct file_lock
> pointer could be used, but struct file_lock does not represent a lock
> request, this does not exist in VFS posix lock API.

It seems to me that struct file_lock objects persist across the entire
locking request, both for the synchronous locking requests of gfs2 and
for the asynchronous locking requests of lockd. In other words, when
lockd cancels a locking request, it seems to use the same struct
file_lock object it used for making the original request (see where
lockd calls vfs_cancel_lock()). This means that the address of that
object would be a suitable locking request identifier. And while we
don't have a huge amount of space left in struct dlm_plock_info, we do
have two 32-bit fields in the version[] array that we could use for
communicating that identifier. It would of course be better if we
didn't need that much space, but I don't see a realistic alternative
if we want to cleanly fix this whole mess.

Thanks,
Andreas

> - Alex
>
> [0] https://listman.redhat.com/archives/cluster-devel/2023-July/024477.html
>



Re: [Cluster-devel] [PATCHv4 v6.5-rc2 3/3] fs: dlm: fix F_CANCELLK to cancel pending request

2023-07-21 Thread Andreas Gruenbacher
On Thu, Jul 20, 2023 at 2:22 PM Alexander Aring  wrote:
> This patch fixes the current handling of F_CANCELLK by not just doing a
> unlock as we need to try to cancel a lock at first. A unlock makes sense
> on a non-blocking lock request but if it's a blocking lock request we
> need to cancel the request until it's not granted yet. This patch is fixing
> this behaviour by first try to cancel a lock request and if it's failed
> it's unlocking the lock which seems to be granted.

I don't like how this is implemented, but the problem already starts
in commit 57e2c2f2d94c ("fs: dlm: fix mismatch of plock results from
userspace"). That commit relies on how dlm_controld is implemented
internally; it requires dlm_controld to keep all replies to
non-blocking requests in perfect order. The correctness of that
approach is more difficult to argue for than it should be, the code
might break in non-obvious ways, and it limits the kinds of things
dlm_controld will be able to do in the future. And this change relies
on the same flawed design.

Instead, when noticing that we don't have a way to uniquely identify
requests, such a way should just have been introduced. The CANCEL
request would then carry the same unique identifier as the original
locking request, dlm_controld would reply either to the original
locking request or to the cancel request, and the kernel would have no
problem matching the reply to the request.

But without unique request identifiers, we now end up with those
redundant -ENOENT replies to CANCEL requests and with those
essentially duplicate entries for the same request on recv_list. Not
great.

Andreas

> Note: currently the nfs locking handling was disabled by commit
> 40595cdc93ed ("nfs: block notification on fs with its own ->lock").
> However DLM was never being updated regarding to this change. Future
> patches will try to fix lockd lock requests for DLM. This patch is
> currently assuming the upstream DLM lockd handling is correct.
> ---
>  fs/dlm/plock.c| 103 +-
>  fs/gfs2/file.c|   9 ++--
>  fs/ocfs2/stack_user.c |  13 ++---
>  include/linux/dlm_plock.h |   2 +
>  4 files changed, 98 insertions(+), 29 deletions(-)
>
> diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> index a8ffa0760913..943d9f8e5564 100644
> --- a/fs/dlm/plock.c
> +++ b/fs/dlm/plock.c
> @@ -42,6 +42,27 @@ static inline void set_version(struct dlm_plock_info *info)
> info->version[2] = DLM_PLOCK_VERSION_PATCH;
>  }
>
> +static struct plock_op *plock_lookup_waiter(const struct dlm_plock_info 
> *info)
> +{
> +   struct plock_op *op = NULL, *iter;
> +
> +   list_for_each_entry(iter, _list, list) {
> +   if (iter->info.fsid == info->fsid &&
> +   iter->info.number == info->number &&
> +   iter->info.owner == info->owner &&
> +   iter->info.pid == info->pid &&
> +   iter->info.start == info->start &&
> +   iter->info.end == info->end &&
> +   iter->info.ex == info->ex &&
> +   iter->info.wait) {
> +   op = iter;
> +   break;
> +   }
> +   }
> +
> +   return op;
> +}
> +
>  static int check_version(struct dlm_plock_info *info)
>  {
> if ((DLM_PLOCK_VERSION_MAJOR != info->version[0]) ||
> @@ -334,6 +355,74 @@ int dlm_posix_unlock(dlm_lockspace_t *lockspace, u64 
> number, struct file *file,
>  }
>  EXPORT_SYMBOL_GPL(dlm_posix_unlock);
>
> +/*
> + * NOTE: This implementation can only handle async lock requests as nfs
> + * do it. It cannot handle cancellation of a pending lock request sitting
> + * in wait_event(), but for now only nfs is the only user local kernel
> + * user.
> + */
> +int dlm_posix_cancel(dlm_lockspace_t *lockspace, u64 number, struct file 
> *file,
> +struct file_lock *fl)
> +{
> +   struct dlm_plock_info info;
> +   struct plock_op *op;
> +   struct dlm_ls *ls;
> +   int rv;
> +
> +   /* this only works for async request for now and nfs is the only
> +* kernel user right now.
> +*/
> +   if (WARN_ON_ONCE(!fl->fl_lmops || !fl->fl_lmops->lm_grant))
> +   return -EOPNOTSUPP;
> +
> +   ls = dlm_find_lockspace_local(lockspace);
> +   if (!ls)
> +   return -EINVAL;
> +
> +   memset(, 0, sizeof(info));
> +   info.pid = fl->fl_pid;
> +   info.ex = (fl->fl_type == F_WRLCK);
> +   info.fsid = ls->ls_global_id;
> +   dlm_put_lockspace(ls);
> +   info.number = number;
> +   info.start = fl->fl_start;
> +   info.end = fl->fl_end;
> +   info.owner = (__u64)fl->fl_pid;
> +
> +   rv = do_lock_cancel();
> +   switch (rv) {
> +   case 0:
> +   spin_lock(_lock);
> +   /* lock request to cancel must be on recv_list because
> +* do_lock_cancel() synchronizes it.
> +*/
> +

Re: [Cluster-devel] [PATCH v1] gfs2: increase usage of folio_next_index() helper

2023-07-17 Thread Andreas Gruenbacher
Minjie,

On Mon, Jul 17, 2023 at 9:20 AM Minjie Du  wrote:
> Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using
> the existing helper folio_next_index().
>
> Signed-off-by: Minjie Du 
> ---
>  fs/gfs2/aops.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index ae49256b7..5f0254237 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -272,8 +272,7 @@ static int gfs2_write_jdata_batch(struct address_space 
> *mapping,
>  * not be suitable for data integrity
>  * writeout).
>  */
> -   *done_index = folio->index +
> -   folio_nr_pages(folio);
> +   *done_index = folio_next_index(folio);
> ret = 1;
> break;
> }
> --
> 2.39.0
>

Applied, thanks.

Email addresses like '"open list:GFS2 FILE SYSTEM"
' and 'open list
' are not helpful though, so could you
please avoid using addresses like that in the future?

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v6.5-rc1 1/2] fs: dlm: introduce DLM_PLOCK_FL_NO_REPLY flag

2023-07-14 Thread Andreas Gruenbacher
On Thu, Jul 13, 2023 at 4:40 PM Alexander Aring  wrote:
> This patch introduces a new flag DLM_PLOCK_FL_NO_REPLY in case an dlm
> plock operation should not send a reply back. Currently this is kind of
> being handled in DLM_PLOCK_FL_CLOSE, but DLM_PLOCK_FL_CLOSE has more
> meanings that it will remove all waiters for a specific nodeid/owner
> values in by doing a unlock operation. In case of an error in dlm user
> space software e.g. dlm_controld we get an reply with an error back.
> This cannot be matched because there is no op to match in recv_list. We
> filter now on DLM_PLOCK_FL_NO_REPLY in case we had an error back as
> reply. In newer dlm_controld version it will never send a result back
> when DLM_PLOCK_FL_NO_REPLY is set. This filter is a workaround to handle
> older dlm_controld versions.

I don't think this makes sense. If dlm_controld understands a
particular request, it already knows whether or not that request
should receive a reply. On the other hand, if dlm_controld doesn't
understand a particular request, it should communicate that fact back
to the kernel so that the kernel will know. The kernel knows which
kinds of requests should and shouldn't receive replies, so when it is
sent a reply it doesn't expect, it knows that dlm_controld didn't
understand the request and is either outdated or plain broken. The
kernel doesn't need to pipe a flag through dlm_controld for figuring
that out.

Thanks,
Andreas

> Fixes: 901025d2f319 ("dlm: make plock operation killable")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Alexander Aring 
> ---
>  fs/dlm/plock.c | 23 +++
>  include/uapi/linux/dlm_plock.h |  1 +
>  2 files changed, 20 insertions(+), 4 deletions(-)
>
> diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> index 70a4752ed913..7fe9f4b922d3 100644
> --- a/fs/dlm/plock.c
> +++ b/fs/dlm/plock.c
> @@ -96,7 +96,7 @@ static void do_unlock_close(const struct dlm_plock_info 
> *info)
> op->info.end= OFFSET_MAX;
> op->info.owner  = info->owner;
>
> -   op->info.flags |= DLM_PLOCK_FL_CLOSE;
> +   op->info.flags |= (DLM_PLOCK_FL_CLOSE | DLM_PLOCK_FL_NO_REPLY);
> send_op(op);
>  }
>
> @@ -293,7 +293,7 @@ int dlm_posix_unlock(dlm_lockspace_t *lockspace, u64 
> number, struct file *file,
> op->info.owner  = (__u64)(long) fl->fl_owner;
>
> if (fl->fl_flags & FL_CLOSE) {
> -   op->info.flags |= DLM_PLOCK_FL_CLOSE;
> +   op->info.flags |= (DLM_PLOCK_FL_CLOSE | 
> DLM_PLOCK_FL_NO_REPLY);
> send_op(op);
> rv = 0;
> goto out;
> @@ -392,7 +392,7 @@ static ssize_t dev_read(struct file *file, char __user 
> *u, size_t count,
> spin_lock(_lock);
> if (!list_empty(_list)) {
> op = list_first_entry(_list, struct plock_op, list);
> -   if (op->info.flags & DLM_PLOCK_FL_CLOSE)
> +   if (op->info.flags & DLM_PLOCK_FL_NO_REPLY)
> list_del(>list);
> else
> list_move_tail(>list, _list);
> @@ -407,7 +407,7 @@ static ssize_t dev_read(struct file *file, char __user 
> *u, size_t count,
>that were generated by the vfs cleaning up for a close
>(the process did not make an unlock call). */
>
> -   if (op->info.flags & DLM_PLOCK_FL_CLOSE)
> +   if (op->info.flags & DLM_PLOCK_FL_NO_REPLY)
> dlm_release_plock_op(op);
>
> if (copy_to_user(u, , sizeof(info)))
> @@ -433,6 +433,21 @@ static ssize_t dev_write(struct file *file, const char 
> __user *u, size_t count,
> if (check_version())
> return -EINVAL;
>
> +   /* Some old dlm user space software will send replies back,
> +* even if DLM_PLOCK_FL_NO_REPLY is set (because the flag is
> +* new) e.g. if a error occur. We can't match them in recv_list
> +* because they were never be part of it. We filter it here,
> +* new dlm user space software will filter it in user space.
> +*
> +* In future this handling can be removed.
> +*/
> +   if (info.flags & DLM_PLOCK_FL_NO_REPLY) {
> +   pr_info("Received unexpected reply from op %d, "
> +   "please update DLM user space software!\n",
> +   info.optype);
> +   return count;
> +   }
> +
> /*
>  * The results for waiting ops (SETLKW) can be returned in any
>  * order, so match all fields to find the op.  The results for
> diff --git a/include/uapi/linux/dlm_plock.h b/include/uapi/linux/dlm_plock.h
> index 63b6c1fd9169..8dfa272c929a 100644
> --- a/include/uapi/linux/dlm_plock.h
> +++ b/include/uapi/linux/dlm_plock.h
> @@ -25,6 +25,7 @@ enum {
>  };
>
>  #define DLM_PLOCK_FL_CLOSE 1
> +#define DLM_PLOCK_FL_NO_REPLY 2
>
>  struct dlm_plock_info {
> __u32 version[3];
> --
> 2.31.1
>



Re: [Cluster-devel] [PATCH] gfs2: fix timestamp handling on quota inodes

2023-07-13 Thread Andreas Gruenbacher
Jeff and Christian,

On Thu, Jul 13, 2023 at 3:52 PM Jeff Layton  wrote:
> While these aren't generally visible from userland, it's best to be
> consistent with timestamp handling. When adjusting the quota, update the
> mtime and ctime like we would with a write operation on any other inode,
> and avoid updating the atime which should only be done for reads.
>
> Signed-off-by: Jeff Layton 
> ---
>  fs/gfs2/quota.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Christian,
>
> Would you mind picking this into the vfs.ctime branch, assuming the GFS2
> maintainers ack it? Andreas and I had discussed this privately, and I
> think it makes sense as part of that series.

Yes, please.

> Thanks,
> Jeff
>
> diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> index 704192b73605..aa5fd06d47bc 100644
> --- a/fs/gfs2/quota.c
> +++ b/fs/gfs2/quota.c
> @@ -871,7 +871,7 @@ static int gfs2_adjust_quota(struct gfs2_inode *ip, 
> loff_t loc,
> size = loc + sizeof(struct gfs2_quota);
> if (size > inode->i_size)
> i_size_write(inode, size);
> -   inode->i_mtime = inode->i_atime = current_time(inode);
> +   inode->i_mtime = inode_set_ctime_current(inode);
> mark_inode_dirty(inode);
>     set_bit(QDF_REFRESH, >qd_flags);
> }
> --
> 2.41.0
>

Reviewed-by: Andreas Gruenbacher 

Thanks,
Andreas



Re: [Cluster-devel] gfs2 write bandwidth regression on 6.4-rc3 compareto 5.15.y

2023-07-10 Thread Andreas Gruenbacher
Hi Wang Yugui,

On Sun, May 28, 2023 at 5:53 PM Wang Yugui  wrote:
> Hi,
>
> > Hi,
> >
> > gfs2 write bandwidth regression on 6.4-rc3 compare to 5.15.y.
> >
> > we added  linux-xfs@ and linux-fsdevel@ because some related problem[1]
> > and related patches[2].
> >
> > we compared 6.4-rc3(rather than 6.1.y) to 5.15.y because some related 
> > patches[2]
> > work only for 6.4 now.
> >
> > [1] 
> > https://lore.kernel.org/linux-xfs/20230508172406.1cf3.40950...@e16-tech.com/
> > [2] 
> > https://lore.kernel.org/linux-xfs/20230520163603.1794256-1-wi...@infradead.org/
> >
> >
> > test case:
> > 1) PCIe3 SSD *4 with LVM
> > 2) gfs2 lock_nolock
> > gfs2 attr(T) GFS2_AF_ORLOV
> ># chattr +T /mnt/test
> > 3) fio
> > fio --name=global --rw=write -bs=1024Ki -size=32Gi -runtime=30 -iodepth 1
> > -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=1 \
> >   -name write-bandwidth-1 -filename=/mnt/test/sub1/1.txt \
> >   -name write-bandwidth-2 -filename=/mnt/test/sub2/1.txt \
> >   -name write-bandwidth-3 -filename=/mnt/test/sub3/1.txt \
> >   -name write-bandwidth-4 -filename=/mnt/test/sub4/1.txt
> > 4) patches[2] are applied to 6.4-rc3.
> >
> >
> > 5.15.y result
> >   fio WRITE: bw=5139MiB/s (5389MB/s),
> > 6.4-rc3 result
> >   fio  WRITE: bw=2599MiB/s (2725MB/s)
>
> more test result:
>
> 5.17.0  WRITE: bw=4988MiB/s (5231MB/s)
> 5.18.0  WRITE: bw=5165MiB/s (5416MB/s)
> 5.19.0  WRITE: bw=5511MiB/s (5779MB/s)
> 6.0.5   WRITE: bw=3055MiB/s (3203MB/s), WRITE: bw=3225MiB/s (3382MB/s)
> 6.1.30  WRITE: bw=2579MiB/s (2705MB/s)
>
> so this regression  happen in some code introduced in 6.0,
> and maybe some minor regression in 6.1 too?

thanks for this bug report. Bob has noticed a similar looking
performance regression recently, and it turned out that commit
e1fa9ea85ce8 ("gfs2: Stop using glock holder auto-demotion for now")
inadvertently caused buffered writes to fall back to writing single
pages instead of multiple pages at once. That patch was added in
v5.18, so it doesn't perfectly align with the regression history
you're reporting, but maybe there's something else going on that we're
not aware of.

In any case, the regression introduced by commit e1fa9ea85ce8 should
be fixed by commit c8ed1b359312 ("gfs2: Fix duplicate
should_fault_in_pages() call"), which ended up in v6.5-rc1.

Could you please check where we end up with that fix?

Thank you very much,
Andreas



Re: [Cluster-devel] [PATCH 7/9] gfs2: update ctime when quota is updated

2023-07-05 Thread Andreas Gruenbacher
On Mon, Jun 12, 2023 at 12:36 PM Jeff Layton  wrote:
> On Fri, 2023-06-09 at 18:44 +0200, Andreas Gruenbacher wrote:
> > Jeff,
> >
> > On Fri, Jun 9, 2023 at 2:50 PM Jeff Layton  wrote:
> > > Signed-off-by: Jeff Layton 
> > > ---
> > >  fs/gfs2/quota.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> > > index 1ed17226d9ed..6d283e071b90 100644
> > > --- a/fs/gfs2/quota.c
> > > +++ b/fs/gfs2/quota.c
> > > @@ -869,7 +869,7 @@ static int gfs2_adjust_quota(struct gfs2_inode *ip, 
> > > loff_t loc,
> > > size = loc + sizeof(struct gfs2_quota);
> > > if (size > inode->i_size)
> > > i_size_write(inode, size);
> > > -   inode->i_mtime = inode->i_atime = current_time(inode);
> > > +   inode->i_mtime = inode->i_atime = inode->i_ctime = 
> > > current_time(inode);
> >
> > I don't think we need to worry about the ctime of the quota inode as
> > that inode is internal to the filesystem only.
> >
>
> Thanks Andreas.  I'll plan to drop this patch from the series for now.
>
> Does updating the mtime and atime here serve any purpose, or should
> those also be removed? If you plan to keep the a/mtime updates then I'd
> still suggest updating the ctime for consistency's sake. It shouldn't
> cost anything extra to do so since you're dirtying the inode below
> anyway.

Yes, good point actually, we should keep things consistent for simplicity.

Would you add this back in if you do another posting?

Thanks,
Andreas

> Thanks!
>
> > > mark_inode_dirty(inode);
> > > set_bit(QDF_REFRESH, >qd_flags);
> > > }
> > > --
> > > 2.40.1
> > >
> >
> > Thanks,
> > Andreas
> >
>
> --
> Jeff Layton 
>



[Cluster-devel] [GIT PULL] gfs2 fixes

2023-07-04 Thread Andreas Gruenbacher
Hi Linus,

please consider pulling the following gfs2 fixes.

Thanks a lot,
Andreas

The following changes since commit 0bdd0f0bf17c5aac16f348ee4b1ebf23d1ec1649:

  Merge tag 'gfs2-v6.4-rc4-fix' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 (2023-06-06 
05:49:06 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
tags/gfs2-v6.4-rc5-fixes

for you to fetch changes up to 432928c9377959684c748a9bc6553ed2d3c2ea4f:

  gfs2: Add quota_change type (2023-07-03 22:30:48 +0200)


gfs2 fixes

- Move the freeze/thaw logic from glock callback context to process /
  worker thread context to prevent deadlocks.

- Fix a quota reference couting bug in do_qc().

- Carry on deallocating inodes even when gfs2_rindex_update() fails.

- Retry filesystem-internal reads when they are interruped by a signal.

- Eliminate kmap_atomic() in favor of kmap_local_page() /
  memcpy_{from,to}_page().

- Get rid of noop_direct_IO.

- And a few more minor fixes and cleanups.


Andreas Gruenbacher (12):
  gfs2: retry interrupted internal reads
  gfs2: Rename remaining "transaction" glock references
  gfs2: Rename the {freeze,thaw}_super callbacks
  gfs2: Rename gfs2_freeze_lock{ => _shared }
  gfs2: Reconfiguring frozen filesystem already rejected
  gfs2: Rename SDF_{FS_FROZEN => FREEZE_INITIATOR}
  gfs2: Rework freeze / thaw logic
  gfs2: Replace sd_freeze_state with SDF_FROZEN flag
  gfs2: gfs2_freeze_lock_shared cleanup
  gfs: Get rid of unnucessary locking in inode_go_dump
  gfs2: Convert remaining kmap_atomic calls to kmap_local_page
  gfs2: Use memcpy_{from,to}_page where appropriate

Bob Peterson (8):
  gfs2: simplify gdlm_put_lock with out_free label
  gfs2: fix minor comment typos
  gfs2: ignore rindex_update failure in dinode_dealloc
  gfs2: Fix gfs2_qa_get imbalance in gfs2_quota_hold
  gfs2: Update rl_unlinked before releasing rgrp lock
  gfs2: Don't remember delete unless it's successful
  gfs2: Fix duplicate should_fault_in_pages() call
  gfs2: Add quota_change type

Christoph Hellwig (1):
  gfs2: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method

Deepak R Varma (1):
  gfs2: Replace deprecated kmap_atomic with kmap_local_page

Tuo Li (1):
  gfs2: Fix possible data races in gfs2_show_options()

 fs/gfs2/aops.c   |  19 +++--
 fs/gfs2/bmap.c   |   4 +-
 fs/gfs2/file.c   |   5 +-
 fs/gfs2/glock.c  |   4 +-
 fs/gfs2/glops.c  |  69 ++---
 fs/gfs2/incore.h |  12 +--
 fs/gfs2/lock_dlm.c   |  23 +++---
 fs/gfs2/log.c|  11 +--
 fs/gfs2/lops.c   |  21 +++--
 fs/gfs2/ops_fstype.c |  15 +---
 fs/gfs2/quota.c  |  26 ---
 fs/gfs2/recovery.c   |  28 +++
 fs/gfs2/rgrp.c   |   2 +-
 fs/gfs2/super.c  | 215 ---
 fs/gfs2/super.h  |   1 +
 fs/gfs2/sys.c|   4 +-
 fs/gfs2/trans.c  |   3 +-
 fs/gfs2/util.c   |  49 
 fs/gfs2/util.h   |   3 +-
 19 files changed, 277 insertions(+), 237 deletions(-)



Re: [Cluster-devel] [PATCH] fs: Fix bug in gfs2_freeze_func that can cause deadlock

2023-07-03 Thread Andreas Gruenbacher
Li Dong,

On Tue, Jun 20, 2023 at 5:47 AM Li Dong  wrote:
> Function gfs2_freeze_func causes a deadlock,because sd_freeze_mutex was
> not released when return
>
> Signed-off-by: Li Dong 
> ---
>  fs/gfs2/super.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -741,8 +741,10 @@ void gfs2_freeze_func(struct work_struct *work)
> set_bit(SDF_FROZEN, >sd_flags);
>
> error = gfs2_do_thaw(sdp);
> -   if (error)
> +   if (error) {
> +   mutex_unlock(>sd_freeze_mutex);
> goto out;
> +   }

thank you for this bug report. I have chosen to fold this fix into
commit "gfs2: Rework freeze / thaw logic" and clean up
gfs2_freeze_func() a little along the way; see the current for-next
branch.

Thanks,
Andreas

> clear_bit(SDF_FROZEN, >sd_flags);
>
> --
> 2.31.1.windows.1



Re: [Cluster-devel] [PATCH v3 0/6] gfs2: kmap{_atomic} conversion to kmap_local_{page/folio}

2023-07-03 Thread Andreas Gruenbacher
Hi Deepak,

On Thu, Jun 29, 2023 at 11:48 PM Deepak R Varma  wrote:
> This patch series proposes to replace the kmap/kmap_atomic implementation to 
> the
> preferred kmap_local_* APIs.
>
> The code blocks for this module where kmap/kmap_atomic calls are implemented 
> do
> not appear to depend on disabling page-faults or preemption. Hence such code
> blocks are safe for converting to improved kmap_local_{page,folio} APIs.
>
> Note: The proposed patches are build tested only.
>
> Initially, only a single patch was sent and now being converted into a patch
> series including the other files/functions of this module. Hence all patches,
> that are included for the first time in this series are also marked as v3.
>
> Changes in v3:
>- Patch set introduced to include all gfs2 kmap conversions
>- Patches 3/6 through 6/6 are included to build the series
>- Initial stand-alone patch split into 2 patches [1/6 and 2/6]

I have already merged version 2 of this patch series and I've fixed up
the remaining issues in follow-up patches; see the cluster-devel
mailing list:

https://listman.redhat.com/archives/cluster-devel/2023-June/024391.html
https://listman.redhat.com/archives/cluster-devel/2023-June/024392.html
https://listman.redhat.com/archives/cluster-devel/2023-June/024393.html

As well as our for-next branch:

https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/log/?h=for-next

As far as I can see, there is nothing in v3 of your patches that I
haven't addressed already. Please speak out if I've missed anything.

Thanks,
Andreas


>
> Changes in v2:
>- 3/6 to 6/6: None.
>- 1/6 + 2/6: Correct patch description for the replacement function name 
> from
>  kmap_local_folio to kmap_local_page
>
> Deepak R Varma (6):
>   gfs2: Replace kmap_atomic() by kmap_local_page() in stuffed_readpage
>   gfs2: Replace kmap_atomic()+memcpy by memcpy_from_page()
>   gfs2: Replace kmap() by kmap_local_page() in gfs2_unstuffer_page
>   gfs2: Replace kmap_atomic() by kmap_local_page() in lops.c
>   gfs2: Replace kmap() by kmap_local_page() in gfs2_read_super
>   gfs2: Replace kmap_atomic() by kmap_local_page() in
> gfs2_write_buf_to_page
>
>  fs/gfs2/aops.c   | 13 ++---
>  fs/gfs2/bmap.c   |  4 ++--
>  fs/gfs2/lops.c   | 12 ++--
>  fs/gfs2/ops_fstype.c |  4 ++--
>  fs/gfs2/quota.c  |  4 ++--
>  5 files changed, 18 insertions(+), 19 deletions(-)
>
> --
> 2.34.1
>
>
>



[Cluster-devel] [PATCH 2/2] gfs2: Use memcpy_{from, to}_page where appropriate

2023-06-27 Thread Andreas Gruenbacher
Replace kmap_local_page() + memcpy() + kunmap_local() sequences with
memcpy_{from,to}_page() where we are not doing anything else with the
mapped page.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/aops.c  |  5 +
 fs/gfs2/lops.c  | 12 +---
 fs/gfs2/quota.c |  5 +
 3 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index c45bac9b5408..b11ec198183b 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -488,7 +488,6 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, 
loff_t *pos,
unsigned copied = 0;
unsigned amt;
struct page *page;
-   void *p;
 
do {
page = read_cache_page(mapping, index, gfs2_read_folio, NULL);
@@ -497,12 +496,10 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, 
loff_t *pos,
continue;
return PTR_ERR(page);
}
-   p = kmap_local_page(page);
amt = size - copied;
if (offset + size > PAGE_SIZE)
amt = PAGE_SIZE - offset;
-   memcpy(buf + copied, p + offset, amt);
-   kunmap_local(p);
+   memcpy_from_page(buf + copied, page, offset, amt);
put_page(page);
copied += amt;
index++;
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 17641d90394b..251322b01631 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -697,14 +697,12 @@ static void gfs2_before_commit(struct gfs2_sbd *sdp, 
unsigned int limit,
lock_buffer(bd2->bd_bh);
 
if (buffer_escaped(bd2->bd_bh)) {
-   void *kaddr;
+   void *p;
+
page = mempool_alloc(gfs2_page_pool, GFP_NOIO);
-   ptr = page_address(page);
-   kaddr = kmap_local_page(bd2->bd_bh->b_page);
-   memcpy(ptr, kaddr + bh_offset(bd2->bd_bh),
-  bd2->bd_bh->b_size);
-   kunmap_local(kaddr);
-   *(__be32 *)ptr = 0;
+   p = page_address(page);
+   memcpy_from_page(p, page, 
bh_offset(bd2->bd_bh), bd2->bd_bh->b_size);
+   *(__be32 *)p = 0;
clear_buffer_escaped(bd2->bd_bh);
unlock_buffer(bd2->bd_bh);
brelse(bd2->bd_bh);
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 42a3f1e6b553..7765346e3617 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -712,7 +712,6 @@ static int gfs2_write_buf_to_page(struct gfs2_inode *ip, 
unsigned long index,
struct address_space *mapping = inode->i_mapping;
struct page *page;
struct buffer_head *bh;
-   void *kaddr;
u64 blk;
unsigned bsize = sdp->sd_sb.sb_bsize, bnum = 0, boff = 0;
unsigned to_write = bytes, pg_off = off;
@@ -764,9 +763,7 @@ static int gfs2_write_buf_to_page(struct gfs2_inode *ip, 
unsigned long index,
}
 
/* Write to the page, now that we have setup the buffer(s) */
-   kaddr = kmap_local_page(page);
-   memcpy(kaddr + off, buf, bytes);
-   kunmap_local(kaddr);
+   memcpy_to_page(page, off, buf, bytes);
flush_dcache_page(page);
unlock_page(page);
put_page(page);
-- 
2.40.0



[Cluster-devel] [PATCH 1/2] gfs2: Convert remaining kmap_atomic calls to kmap_local_page

2023-06-27 Thread Andreas Gruenbacher
Replace the remaining instances of kmap_atomic() ... kunmap_atomic()
with kmap_local_page() ... kunmap_local().

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/lops.c  | 13 +++--
 fs/gfs2/quota.c |  4 ++--
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 1902413d5d12..17641d90394b 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -427,10 +427,11 @@ static bool gfs2_jhead_pg_srch(struct gfs2_jdesc *jd,
 {
struct gfs2_sbd *sdp = GFS2_SB(jd->jd_inode);
struct gfs2_log_header_host lh;
-   void *kaddr = kmap_atomic(page);
+   void *kaddr;
unsigned int offset;
bool ret = false;
 
+   kaddr = kmap_local_page(page);
for (offset = 0; offset < PAGE_SIZE; offset += sdp->sd_sb.sb_bsize) {
if (!__get_log_header(sdp, kaddr + offset, 0, )) {
if (lh.lh_sequence >= head->lh_sequence)
@@ -441,7 +442,7 @@ static bool gfs2_jhead_pg_srch(struct gfs2_jdesc *jd,
}
}
}
-   kunmap_atomic(kaddr);
+   kunmap_local(kaddr);
return ret;
 }
 
@@ -626,11 +627,11 @@ static void gfs2_check_magic(struct buffer_head *bh)
__be32 *ptr;
 
clear_buffer_escaped(bh);
-   kaddr = kmap_atomic(bh->b_page);
+   kaddr = kmap_local_page(bh->b_page);
ptr = kaddr + bh_offset(bh);
if (*ptr == cpu_to_be32(GFS2_MAGIC))
set_buffer_escaped(bh);
-   kunmap_atomic(kaddr);
+   kunmap_local(kaddr);
 }
 
 static int blocknr_cmp(void *priv, const struct list_head *a,
@@ -699,10 +700,10 @@ static void gfs2_before_commit(struct gfs2_sbd *sdp, 
unsigned int limit,
void *kaddr;
page = mempool_alloc(gfs2_page_pool, GFP_NOIO);
ptr = page_address(page);
-   kaddr = kmap_atomic(bd2->bd_bh->b_page);
+   kaddr = kmap_local_page(bd2->bd_bh->b_page);
memcpy(ptr, kaddr + bh_offset(bd2->bd_bh),
   bd2->bd_bh->b_size);
-   kunmap_atomic(kaddr);
+   kunmap_local(kaddr);
*(__be32 *)ptr = 0;
clear_buffer_escaped(bd2->bd_bh);
unlock_buffer(bd2->bd_bh);
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 386ca770ce2e..42a3f1e6b553 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -764,10 +764,10 @@ static int gfs2_write_buf_to_page(struct gfs2_inode *ip, 
unsigned long index,
}
 
/* Write to the page, now that we have setup the buffer(s) */
-   kaddr = kmap_atomic(page);
+   kaddr = kmap_local_page(page);
memcpy(kaddr + off, buf, bytes);
+   kunmap_local(kaddr);
flush_dcache_page(page);
-   kunmap_atomic(kaddr);
unlock_page(page);
put_page(page);
 
-- 
2.40.0



Re: [Cluster-devel] [PATCH v2] gfs2: Replace deprecated kmap_atomic() by kmap_local_page()

2023-06-27 Thread Andreas Gruenbacher
On Mon, Jun 26, 2023 at 8:51 AM Deepak R Varma  wrote:
> kmap_atomic() is deprecated in favor of kmap_local_{folio,page}().

I'll apply this, convert the remaining instances of kmap_atomic(), and
switch to memcpy_{from,to}_page() where appropriate.

Thanks,
Andreas

> Therefore, replace kmap_atomic() with kmap_local_page() in
> gfs2_internal_read() and stuffed_readpage().
>
> kmap_atomic() disables page-faults and preemption (the latter only for
> !PREEMPT_RT kernels), However, the code within the mapping/un-mapping in
> gfs2_internal_read() and stuffed_readpage() does not depend on the
> above-mentioned side effects.
>
> Therefore, a mere replacement of the old API with the new one is all that
> is required (i.e., there is no need to explicitly add any calls to
> pagefault_disable() and/or preempt_disable()).
>
> Signed-off-by: Deepak R Varma 
> ---
> Note: The Patch is build tested only. I will be happy to run recommended 
> testing
> with some guidance if required.
>
> Changes in v2:
>- Update patch description to correct the replacement function name from
>  kmap_local_folio to kmap_local _page
>
>
>  fs/gfs2/aops.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index 3b41542d6697..7bd92054d353 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -432,10 +432,10 @@ static int stuffed_readpage(struct gfs2_inode *ip, 
> struct page *page)
> if (error)
> return error;
>
> -   kaddr = kmap_atomic(page);
> +   kaddr = kmap_local_page(page);
> memcpy(kaddr, dibh->b_data + sizeof(struct gfs2_dinode), dsize);
> memset(kaddr + dsize, 0, PAGE_SIZE - dsize);
> -   kunmap_atomic(kaddr);
> +   kunmap_local(kaddr);
> flush_dcache_page(page);
> brelse(dibh);
> SetPageUptodate(page);
> @@ -498,12 +498,12 @@ int gfs2_internal_read(struct gfs2_inode *ip, char 
> *buf, loff_t *pos,
> continue;
> return PTR_ERR(page);
> }
> -   p = kmap_atomic(page);
> +   p = kmap_local_page(page);
> amt = size - copied;
> if (offset + size > PAGE_SIZE)
> amt = PAGE_SIZE - offset;
> memcpy(buf + copied, p + offset, amt);
> -   kunmap_atomic(p);
> +   kunmap_local(p);
> put_page(page);
> copied += amt;
> index++;
> --
> 2.34.1
>
>
>



[Cluster-devel] [PATCH] gfs: Get rid of unnucessary locking in inode_go_dump

2023-06-27 Thread Andreas Gruenbacher
Commit 27a2660f1ef9 ("gfs2: Dump nrpages for inodes and their glocks")
added some locking around reading inode->i_data.nrpages.  That locking
doesn't do anything really, so get rid of it.

With that, the glock argument to ->go_dump() can be made const again as
well.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/glops.c  | 17 ++---
 fs/gfs2/incore.h |  2 +-
 2 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 7c48c7afd6a4..54319328b16b 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -236,7 +236,7 @@ static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
truncate_inode_pages_range(mapping, start, end);
 }
 
-static void gfs2_rgrp_go_dump(struct seq_file *seq, struct gfs2_glock *gl,
+static void gfs2_rgrp_go_dump(struct seq_file *seq, const struct gfs2_glock 
*gl,
  const char *fs_id_buf)
 {
struct gfs2_rgrpd *rgd = gl->gl_object;
@@ -536,28 +536,23 @@ static int inode_go_held(struct gfs2_holder *gh)
  *
  */
 
-static void inode_go_dump(struct seq_file *seq, struct gfs2_glock *gl,
+static void inode_go_dump(struct seq_file *seq, const struct gfs2_glock *gl,
  const char *fs_id_buf)
 {
struct gfs2_inode *ip = gl->gl_object;
-   struct inode *inode;
-   unsigned long nrpages;
+   const struct inode *inode = >i_inode;
 
if (ip == NULL)
return;
 
-   inode = >i_inode;
-   xa_lock_irq(>i_data.i_pages);
-   nrpages = inode->i_data.nrpages;
-   xa_unlock_irq(>i_data.i_pages);
-
gfs2_print_dbg(seq, "%s I: n:%llu/%llu t:%u f:0x%02lx d:0x%08x s:%llu "
   "p:%lu\n", fs_id_buf,
  (unsigned long long)ip->i_no_formal_ino,
  (unsigned long long)ip->i_no_addr,
- IF2DT(ip->i_inode.i_mode), ip->i_flags,
+ IF2DT(inode->i_mode), ip->i_flags,
  (unsigned int)ip->i_diskflags,
- (unsigned long long)i_size_read(inode), nrpages);
+ (unsigned long long)i_size_read(inode),
+ inode->i_data.nrpages);
 }
 
 /**
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 8b1b50d19de1..04f2d78e8658 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -221,7 +221,7 @@ struct gfs2_glock_operations {
int (*go_demote_ok) (const struct gfs2_glock *gl);
int (*go_instantiate) (struct gfs2_glock *gl);
int (*go_held)(struct gfs2_holder *gh);
-   void (*go_dump)(struct seq_file *seq, struct gfs2_glock *gl,
+   void (*go_dump)(struct seq_file *seq, const struct gfs2_glock *gl,
const char *fs_id_buf);
void (*go_callback)(struct gfs2_glock *gl, bool remote);
void (*go_free)(struct gfs2_glock *gl);
-- 
2.40.0



[Cluster-devel] [PATCH] gfs2: retry interrupted internal reads

2023-06-13 Thread Andreas Gruenbacher
The iomap-based read operations done by gfs2 for its system files, such
as rindex, may sometimes be interrupted and return -EINTR.   This
confuses some users of gfs2_internal_read().  Fix that by retrying
interrupted reads.

Signed-off-by: Bob Peterson 
Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/aops.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index d95125714ebb..dacc21b1ae00 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -491,13 +491,16 @@ int gfs2_internal_read(struct gfs2_inode *ip, char *buf, 
loff_t *pos,
void *p;
 
do {
-   amt = size - copied;
-   if (offset + size > PAGE_SIZE)
-   amt = PAGE_SIZE - offset;
page = read_cache_page(mapping, index, gfs2_read_folio, NULL);
-   if (IS_ERR(page))
+   if (IS_ERR(page)) {
+   if (PTR_ERR(page) == -EINTR)
+   continue;
return PTR_ERR(page);
+   }
p = kmap_atomic(page);
+   amt = size - copied;
+   if (offset + size > PAGE_SIZE)
+   amt = PAGE_SIZE - offset;
memcpy(buf + copied, p + offset, amt);
kunmap_atomic(p);
put_page(page);
-- 
2.40.0



Re: [Cluster-devel] [PATCH 6/8] gfs2: Rework freeze / thaw logic

2023-06-13 Thread Andreas Gruenbacher
On Tue, Jun 13, 2023 at 3:05 PM Alexander Aring  wrote:
> Hi Andreas,
>
> On Mon, Jun 12, 2023 at 12:33 PM Andreas Gruenbacher
>  wrote:
> ...
> >
> > @@ -152,24 +151,18 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp)
> >  */
> > clear_bit(SDF_JOURNAL_LIVE, >sd_flags);
> > if (!sb_rdonly(sdp->sd_vfs)) {
> > -   struct gfs2_holder freeze_gh;
> > -
> > -   gfs2_holder_mark_uninitialized(_gh);
> > -   if (sdp->sd_freeze_gl &&
> > -   !gfs2_glock_is_locked_by_me(sdp->sd_freeze_gl)) {
> > -   ret = gfs2_freeze_lock_shared(sdp, _gh,
> > -   log_write_allowed ? 0 : 
> > LM_FLAG_TRY);
> > -   if (ret == GLR_TRYFAILED)
> > -   ret = 0;
> > -   }
> > -   if (!ret)
> > -   gfs2_make_fs_ro(sdp);
> > +   bool locked = mutex_trylock(>sd_freeze_mutex);
> > +
> > +   gfs2_make_fs_ro(sdp);
> > +
> > +   if (locked)
> > +   mutex_unlock(>sd_freeze_mutex);
>
> I am not sure if I overlooked something here, for me it looks like the
> application does not care about if sd_freeze_mutex is locked or not
> because the introduced locked boolean will never be evaluated?
>
> What am I missing here?

This is to withdraw the filesystem. We're trying to acquire
sd_freeze_mutex to prevent local races, but if we can't get it, we
still go ahead and mark the filesystem read-only. Then we unlock
sd_freeze_mutex, but only if we've locked it before. This is a bit
ugly, but I don't have any better ideas right now.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH 6/8] gfs2: Rework freeze / thaw logic

2023-06-13 Thread Andreas Gruenbacher
On Tue, Jun 13, 2023 at 12:29 PM Andrew Price  wrote:
> On 12/06/2023 17:33, Andreas Gruenbacher wrote:
> > So far, at mount time, gfs2 would take the freeze glock in shared mode
> > and then immediately drop it again, turning it into a cached glock that
> > can be reclaimed at any time.  To freeze the filesystem cluster-wide,
> > the node initiating the freeze would take the freeze glock in exclusive
> > mode, which would cause the freeze glock's freeze_go_sync() callback to
> > run on each node.  There, gfs2 would freeze the filesystem and schedule
> > gfs2_freeze_func() to run.  gfs2_freeze_func() would re-acquire the
> > freeze glock in shared mode, thaw the filesystem, and drop the freeze
> > glock again.  The initiating node would keep the freeze glock held in
> > exclusive mode.  To thaw the filesystem, the initiating node would drop
> > the freeze glock again, which would allow gfs2_freeze_func() to resume
> > on all nodes, leaving the filesystem in the thawed state.
> >
> > It turns out that in freeze_go_sync(), we cannot reliably and safely
> > freeze the filesystem.  This is primarily because the final unmount of a
> > filesystem takes a write lock on the s_umount rw semaphore before
> > calling into gfs2_put_super(), and freeze_go_sync() needs to call
> > freeze_super() which also takes a write lock on the same semaphore,
> > causing a deadlock.  We could work around this by trying to take an
> > active reference on the super block first, which would prevent unmount
> > from running at the same time.  But that can fail, and freeze_go_sync()
> > isn't actually allowed to fail.
> >
> > To get around this, this patch changes the freeze glock locking scheme
> > as follows:
> >
> > At mount time, each node takes the freeze glock in shared mode.  To
> > freeze a filesystem, the initiating node first freezes the filesystem
> > locally and then drops and re-acquires the freeze glock in exclusive
> > mode.  All other nodes notice that there is contention on the freeze
> > glock in their go_callback callbacks, and they schedule
> > gfs2_freeze_func() to run.  There, they freeze the filesystem locally
> > and drop and re-acquire the freeze glock before re-thawing the
> > filesystem.  This is happening outside of the glock state engine, so
> > there, we are allowed to fail.
> >
> >  From a cluster point of view, taking and immediately dropping a glock is
> > indistinguishable from taking the glock and only dropping it upon
> > contention, so this new scheme is compatible with the old one.
>
> Nice!
>
> > Signed-off-by: Andreas Gruenbacher 
> > ---
> >   fs/gfs2/glops.c  |  52 +
> >   fs/gfs2/log.c|   2 -
> >   fs/gfs2/ops_fstype.c |   5 +-
> >   fs/gfs2/recovery.c   |  24 +++---
> >   fs/gfs2/super.c  | 179 ++-
> >   fs/gfs2/super.h  |   1 +
> >   fs/gfs2/util.c   |  32 +++-
> >   7 files changed, 185 insertions(+), 110 deletions(-)
> >
> > diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
> > index 01d433ed6ce7..7c48c7afd6a4 100644
> > --- a/fs/gfs2/glops.c
> > +++ b/fs/gfs2/glops.c
> > @@ -561,47 +561,33 @@ static void inode_go_dump(struct seq_file *seq, 
> > struct gfs2_glock *gl,
> >   }
> >
> >   /**
> > - * freeze_go_sync - promote/demote the freeze glock
> > + * freeze_go_callback - A cluster node is requesting a freeze
> >* @gl: the glock
> > + * @remote: true if this came from a different cluster node
> >*/
> >
> > -static int freeze_go_sync(struct gfs2_glock *gl)
> > +static void freeze_go_callback(struct gfs2_glock *gl, bool remote)
> >   {
> > - int error = 0;
> >   struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> > + struct super_block *sb = sdp->sd_vfs;
> > +
> > + if (!remote ||
> > + gl->gl_state != LM_ST_SHARED ||
> > + gl->gl_demote_state != LM_ST_UNLOCKED)
> > + return;
> >
> >   /*
> > -  * We need to check gl_state == LM_ST_SHARED here and not gl_req ==
> > -  * LM_ST_EXCLUSIVE. That's because when any node does a freeze,
> > -  * all the nodes should have the freeze glock in SH mode and they all
> > -  * call do_xmote: One for EX and the others for UN. They ALL must
> > -  * freeze locally, and they ALL must queue freeze work. The 
> > freeze_work
> > -  * calls freeze_func, which tries to reacquire the freeze glock in SH,
> > -  * effectively waiting for the thaw on the node who holds it in EX.
&g

Re: [Cluster-devel] [PATCH] gfs2: Fix possible data races in gfs2_show_options()

2023-06-13 Thread Andreas Gruenbacher
On Tue, Jun 13, 2023 at 5:07 AM Tuo Li  wrote:
> Some fields such as gt_logd_secs of the struct gfs2_tune are accessed
> without holding the lock gt_spin in gfs2_show_options():
>
>   val = sdp->sd_tune.gt_logd_secs;
>   if (val != 30)
> seq_printf(s, ",commit=%d", val);
>
> And thus can cause data races when gfs2_show_options() and other functions
> such as gfs2_reconfigure() are concurrently executed:
>
>   spin_lock(>gt_spin);
>   gt->gt_logd_secs = newargs->ar_commit;
>
> To fix these possible data races, the lock sdp->sd_tune.gt_spin is
> acquired before accessing the fields of gfs2_tune and released after these
> accesses.
>
> Reported-by: BassCheck 
> Signed-off-by: Tuo Li 
> ---
>  fs/gfs2/super.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index a84bf6444bba..671adf38fe03 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -1059,6 +1059,7 @@ static int gfs2_show_options(struct seq_file *s, struct 
> dentry *root)
> }
> if (args->ar_discard)
> seq_puts(s, ",discard");
> +   spin_lock(>sd_tune.gt_spin);
> val = sdp->sd_tune.gt_logd_secs;
> if (val != 30)
> seq_printf(s, ",commit=%d", val);
> @@ -1070,6 +1071,7 @@ static int gfs2_show_options(struct seq_file *s, struct 
> dentry *root)
> val = sdp->sd_tune.gt_quota_quantum;
> if (val != 60)
> seq_printf(s, ",quota_quantum=%d", val);
> +   spin_unlock(>sd_tune.gt_spin);
> if (args->ar_statfs_percent)
> seq_printf(s, ",statfs_percent=%d", args->ar_statfs_percent);
> if (args->ar_errors != GFS2_ERRORS_DEFAULT) {
> --
> 2.34.1
>

Added to for-next (with minor adjustments).

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v3 00/14] gfs2/buffer folio changes for 6.5

2023-06-12 Thread Andreas Gruenbacher
On Mon, Jun 12, 2023 at 11:02 PM Matthew Wilcox (Oracle)
 wrote:
> This kind of started off as a gfs2 patch series, then became entwined
> with buffer heads once I realised that gfs2 was the only remaining
> caller of __block_write_full_page().  For those not in the gfs2 world,
> the big point of this series is that block_write_full_page() should now
> handle large folios correctly.

This is great, thank you. For the gfs2 bits:

Reviewed-by: Andreas Gruenbacher 

> Andrew, if you want, I'll drop it into the pagecache tree, or you
> can just take it.
>
> v3:
>  - Fix a patch title
>  - Fix some checks against i_size to be >= instead of >
>  - Call folio_mark_dirty() instead of folio_set_dirty()
>
> Matthew Wilcox (Oracle) (14):
>   gfs2: Use a folio inside gfs2_jdata_writepage()
>   gfs2: Pass a folio to __gfs2_jdata_write_folio()
>   gfs2: Convert gfs2_write_jdata_page() to gfs2_write_jdata_folio()
>   buffer: Convert __block_write_full_page() to
> __block_write_full_folio()
>   gfs2: Support ludicrously large folios in gfs2_trans_add_databufs()
>   buffer: Make block_write_full_page() handle large folios correctly
>   buffer: Convert block_page_mkwrite() to use a folio
>   buffer: Convert __block_commit_write() to take a folio
>   buffer: Convert page_zero_new_buffers() to folio_zero_new_buffers()
>   buffer: Convert grow_dev_page() to use a folio
>   buffer: Convert init_page_buffers() to folio_init_buffers()
>   buffer: Convert link_dev_buffers to take a folio
>   buffer: Use a folio in __find_get_block_slow()
>   buffer: Convert block_truncate_page() to use a folio
>
>  fs/buffer.c | 257 ++--
>  fs/ext4/inode.c |   4 +-
>  fs/gfs2/aops.c  |  69 +-
>  fs/gfs2/aops.h  |   2 +-
>  fs/ntfs/aops.c  |   2 +-
>  fs/reiserfs/inode.c |   9 +-
>  include/linux/buffer_head.h |   4 +-
>  7 files changed, 172 insertions(+), 175 deletions(-)
>
> --
> 2.39.2
>



[Cluster-devel] [PATCH] gfs2: Fix duplicate should_fault_in_pages() call

2023-06-12 Thread Andreas Gruenbacher
From: Bob Peterson 

In gfs2_file_buffered_write(), we currently jump from the second call of
function should_fault_in_pages() to above the first call, so
should_fault_in_pages() is getting called twice in a row, causing it to
accidentally fall back to single-page writes rather than trying the more
efficient multi-page writes first.

Fix that by moving the retry label to the correct place, behind the
first call to should_fault_in_pages().

Fixes: e1fa9ea85ce8 ("gfs2: Stop using glock holder auto-demotion for now")
Signed-off-by: Bob Peterson 
Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index a6ad64cee885..464ae4bf209e 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1033,8 +1033,8 @@ static ssize_t gfs2_file_buffered_write(struct kiocb 
*iocb,
}
 
gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, gh);
-retry:
if (should_fault_in_pages(from, iocb, _count, _size)) {
+retry:
window_size -= fault_in_iov_iter_readable(from, window_size);
if (!window_size) {
ret = -EFAULT;
-- 
2.40.0



[Cluster-devel] [PATCH 8/8] gfs2: gfs2_freeze_lock_shared cleanup

2023-06-12 Thread Andreas Gruenbacher
All the remaining users of gfs2_freeze_lock_shared() set freeze_gh to
>sd_freeze_gh and flags to 0, so remove those two parameters.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/ops_fstype.c | 2 +-
 fs/gfs2/super.c  | 4 ++--
 fs/gfs2/util.c   | 9 +++--
 fs/gfs2/util.h   | 4 +---
 4 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 24acd17e530c..9375409fd0c5 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1267,7 +1267,7 @@ static int gfs2_fill_super(struct super_block *sb, struct 
fs_context *fc)
}
}
 
-   error = gfs2_freeze_lock_shared(sdp, >sd_freeze_gh, 0);
+   error = gfs2_freeze_lock_shared(sdp);
if (error)
goto fail_per_node;
 
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 828297e9cbca..f2fa0fb80ee1 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -381,7 +381,7 @@ static int gfs2_lock_fs_check_clean(struct gfs2_sbd *sdp)
gfs2_freeze_unlock(>sd_freeze_gh);
 
 relock_shared:
-   error2 = gfs2_freeze_lock_shared(sdp, >sd_freeze_gh, 0);
+   error2 = gfs2_freeze_lock_shared(sdp);
gfs2_assert_withdraw(sdp, !error2);
 
 out:
@@ -709,7 +709,7 @@ static int gfs2_do_thaw(struct gfs2_sbd *sdp)
struct super_block *sb = sdp->sd_vfs;
int error;
 
-   error = gfs2_freeze_lock_shared(sdp, >sd_freeze_gh, 0);
+   error = gfs2_freeze_lock_shared(sdp);
if (error)
goto fail;
error = thaw_super(sb);
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 00494dffb47c..b9db034c7f58 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -95,17 +95,14 @@ int check_journal_clean(struct gfs2_sbd *sdp, struct 
gfs2_jdesc *jd,
 /**
  * gfs2_freeze_lock_shared - hold the freeze glock
  * @sdp: the superblock
- * @freeze_gh: pointer to the requested holder
- * @caller_flags: any additional flags needed by the caller
  */
-int gfs2_freeze_lock_shared(struct gfs2_sbd *sdp, struct gfs2_holder 
*freeze_gh,
-   int caller_flags)
+int gfs2_freeze_lock_shared(struct gfs2_sbd *sdp)
 {
-   int flags = LM_FLAG_NOEXP | GL_EXACT | caller_flags;
+   int flags = LM_FLAG_NOEXP | GL_EXACT;
int error;
 
error = gfs2_glock_nq_init(sdp->sd_freeze_gl, LM_ST_SHARED, flags,
-  freeze_gh);
+  >sd_freeze_gh);
if (error && error != GLR_TRYFAILED)
fs_err(sdp, "can't lock the freeze glock: %d\n", error);
return error;
diff --git a/fs/gfs2/util.h b/fs/gfs2/util.h
index 3291e33e81e9..cdb839529175 100644
--- a/fs/gfs2/util.h
+++ b/fs/gfs2/util.h
@@ -149,9 +149,7 @@ int gfs2_io_error_i(struct gfs2_sbd *sdp, const char 
*function,
 
 extern int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd,
   bool verbose);
-extern int gfs2_freeze_lock_shared(struct gfs2_sbd *sdp,
-  struct gfs2_holder *freeze_gh,
-  int caller_flags);
+extern int gfs2_freeze_lock_shared(struct gfs2_sbd *sdp);
 extern void gfs2_freeze_unlock(struct gfs2_holder *freeze_gh);
 
 #define gfs2_io_error(sdp) \
-- 
2.40.0



[Cluster-devel] [PATCH 6/8] gfs2: Rework freeze / thaw logic

2023-06-12 Thread Andreas Gruenbacher
So far, at mount time, gfs2 would take the freeze glock in shared mode
and then immediately drop it again, turning it into a cached glock that
can be reclaimed at any time.  To freeze the filesystem cluster-wide,
the node initiating the freeze would take the freeze glock in exclusive
mode, which would cause the freeze glock's freeze_go_sync() callback to
run on each node.  There, gfs2 would freeze the filesystem and schedule
gfs2_freeze_func() to run.  gfs2_freeze_func() would re-acquire the
freeze glock in shared mode, thaw the filesystem, and drop the freeze
glock again.  The initiating node would keep the freeze glock held in
exclusive mode.  To thaw the filesystem, the initiating node would drop
the freeze glock again, which would allow gfs2_freeze_func() to resume
on all nodes, leaving the filesystem in the thawed state.

It turns out that in freeze_go_sync(), we cannot reliably and safely
freeze the filesystem.  This is primarily because the final unmount of a
filesystem takes a write lock on the s_umount rw semaphore before
calling into gfs2_put_super(), and freeze_go_sync() needs to call
freeze_super() which also takes a write lock on the same semaphore,
causing a deadlock.  We could work around this by trying to take an
active reference on the super block first, which would prevent unmount
from running at the same time.  But that can fail, and freeze_go_sync()
isn't actually allowed to fail.

To get around this, this patch changes the freeze glock locking scheme
as follows:

At mount time, each node takes the freeze glock in shared mode.  To
freeze a filesystem, the initiating node first freezes the filesystem
locally and then drops and re-acquires the freeze glock in exclusive
mode.  All other nodes notice that there is contention on the freeze
glock in their go_callback callbacks, and they schedule
gfs2_freeze_func() to run.  There, they freeze the filesystem locally
and drop and re-acquire the freeze glock before re-thawing the
filesystem.  This is happening outside of the glock state engine, so
there, we are allowed to fail.

>From a cluster point of view, taking and immediately dropping a glock is
indistinguishable from taking the glock and only dropping it upon
contention, so this new scheme is compatible with the old one.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/glops.c  |  52 +
 fs/gfs2/log.c|   2 -
 fs/gfs2/ops_fstype.c |   5 +-
 fs/gfs2/recovery.c   |  24 +++---
 fs/gfs2/super.c  | 179 ++-
 fs/gfs2/super.h  |   1 +
 fs/gfs2/util.c   |  32 +++-
 7 files changed, 185 insertions(+), 110 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 01d433ed6ce7..7c48c7afd6a4 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -561,47 +561,33 @@ static void inode_go_dump(struct seq_file *seq, struct 
gfs2_glock *gl,
 }
 
 /**
- * freeze_go_sync - promote/demote the freeze glock
+ * freeze_go_callback - A cluster node is requesting a freeze
  * @gl: the glock
+ * @remote: true if this came from a different cluster node
  */
 
-static int freeze_go_sync(struct gfs2_glock *gl)
+static void freeze_go_callback(struct gfs2_glock *gl, bool remote)
 {
-   int error = 0;
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
+   struct super_block *sb = sdp->sd_vfs;
+
+   if (!remote ||
+   gl->gl_state != LM_ST_SHARED ||
+   gl->gl_demote_state != LM_ST_UNLOCKED)
+   return;
 
/*
-* We need to check gl_state == LM_ST_SHARED here and not gl_req ==
-* LM_ST_EXCLUSIVE. That's because when any node does a freeze,
-* all the nodes should have the freeze glock in SH mode and they all
-* call do_xmote: One for EX and the others for UN. They ALL must
-* freeze locally, and they ALL must queue freeze work. The freeze_work
-* calls freeze_func, which tries to reacquire the freeze glock in SH,
-* effectively waiting for the thaw on the node who holds it in EX.
-* Once thawed, the work func acquires the freeze glock in
-* SH and everybody goes back to thawed.
+* Try to get an active super block reference to prevent racing with
+* unmount (see trylock_super()).  But note that unmount isn't the only
+* place where a write lock on s_umount is taken, and we can fail here
+* because of things like remount as well.
 */
-   if (gl->gl_state == LM_ST_SHARED && !gfs2_withdrawn(sdp) &&
-   !test_bit(SDF_NORECOVERY, >sd_flags)) {
-   atomic_set(>sd_freeze_state, SFS_STARTING_FREEZE);
-   error = freeze_super(sdp->sd_vfs);
-   if (error) {
-   fs_info(sdp, "GFS2: couldn't freeze filesystem: %d\n",
-   error);
-   if (gfs2_withdrawn(sdp)) {
-   atomic_set(>sd_freeze_state,

[Cluster-devel] [PATCH 7/8] gfs2: Replace sd_freeze_state with SDF_FROZEN flag

2023-06-12 Thread Andreas Gruenbacher
Replace sd_freeze_state with a new SDF_FROZEN flag.

There no longer is a need for indicating that a freeze is in progress
(SDF_STARTING_FREEZE); we are now protecting the critical sections with
the sd_freeze_mutex.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/incore.h |  8 +---
 fs/gfs2/log.c|  9 -
 fs/gfs2/ops_fstype.c |  1 -
 fs/gfs2/recovery.c   |  2 +-
 fs/gfs2/super.c  | 25 +
 fs/gfs2/sys.c|  2 ++
 fs/gfs2/trans.c  |  3 +--
 7 files changed, 18 insertions(+), 32 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 1f10529ebcf5..8b1b50d19de1 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -608,12 +608,7 @@ enum {
 withdrawing */
SDF_DEACTIVATING= 15,
SDF_EVICTING= 16,
-};
-
-enum gfs2_freeze_state {
-   SFS_UNFROZEN= 0,
-   SFS_STARTING_FREEZE = 1,
-   SFS_FROZEN  = 2,
+   SDF_FROZEN  = 17,
 };
 
 #define GFS2_FSNAME_LEN256
@@ -841,7 +836,6 @@ struct gfs2_sbd {
 
/* For quiescing the filesystem */
struct gfs2_holder sd_freeze_gh;
-   atomic_t sd_freeze_state;
struct mutex sd_freeze_mutex;
 
char sd_fsname[GFS2_FSNAME_LEN + 3 * sizeof(int) + 2];
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index dca535311dee..aa568796207c 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -914,9 +914,8 @@ void gfs2_write_log_header(struct gfs2_sbd *sdp, struct 
gfs2_jdesc *jd,
 static void log_write_header(struct gfs2_sbd *sdp, u32 flags)
 {
blk_opf_t op_flags = REQ_PREFLUSH | REQ_FUA | REQ_META | REQ_SYNC;
-   enum gfs2_freeze_state state = atomic_read(>sd_freeze_state);
 
-   gfs2_assert_withdraw(sdp, (state != SFS_FROZEN));
+   gfs2_assert_withdraw(sdp, !test_bit(SDF_FROZEN, >sd_flags));
 
if (test_bit(SDF_NOBARRIERS, >sd_flags)) {
gfs2_ordered_wait(sdp);
@@ -1036,7 +1035,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct 
gfs2_glock *gl, u32 flags)
 {
struct gfs2_trans *tr = NULL;
unsigned int reserved_blocks = 0, used_blocks = 0;
-   enum gfs2_freeze_state state = atomic_read(>sd_freeze_state);
+   bool frozen = test_bit(SDF_FROZEN, >sd_flags);
unsigned int first_log_head;
unsigned int reserved_revokes = 0;
 
@@ -1067,7 +1066,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct 
gfs2_glock *gl, u32 flags)
if (tr) {
sdp->sd_log_tr = NULL;
tr->tr_first = first_log_head;
-   if (unlikely (state == SFS_FROZEN)) {
+   if (unlikely(frozen)) {
if (gfs2_assert_withdraw_delayed(sdp,
   !tr->tr_num_buf_new && 
!tr->tr_num_databuf_new))
goto out_withdraw;
@@ -1092,7 +1091,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct 
gfs2_glock *gl, u32 flags)
if (flags & GFS2_LOG_HEAD_FLUSH_SHUTDOWN)
clear_bit(SDF_JOURNAL_LIVE, >sd_flags);
 
-   if (unlikely(state == SFS_FROZEN))
+   if (unlikely(frozen))
if (gfs2_assert_withdraw_delayed(sdp, !reserved_revokes))
goto out_withdraw;
 
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 4ce5718719ac..24acd17e530c 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -135,7 +135,6 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
init_rwsem(>sd_log_flush_lock);
atomic_set(>sd_log_in_flight, 0);
init_waitqueue_head(>sd_log_flush_wait);
-   atomic_set(>sd_freeze_state, SFS_UNFROZEN);
mutex_init(>sd_freeze_mutex);
 
return sdp;
diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index afeda936e2be..9c7a9f640bad 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -467,7 +467,7 @@ void gfs2_recover_func(struct work_struct *work)
if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) {
mutex_lock(>sd_freeze_mutex);
 
-   if (atomic_read(>sd_freeze_state) != SFS_UNFROZEN) {
+   if (test_bit(SDF_FROZEN, >sd_flags)) {
mutex_unlock(>sd_freeze_mutex);
fs_warn(sdp, "jid=%u: Can't replay: filesystem "
"is frozen\n", jd->jd_jid);
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index eb41ab32695a..828297e9cbca 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -689,26 +689,19 @@ static int gfs2_freeze_locally(struct gfs2_sbd *sdp)
struct super_block *sb = sdp->sd_vfs;
int error;
 
-   atomic_set(>sd_freeze_state, SFS_STARTING_FREEZE);
-
error = freeze_super(sb);
if (error)
-   goto fail;
+   return 

[Cluster-devel] [PATCH 5/8] gfs2: Rename SDF_{FS_FROZEN => FREEZE_INITIATOR}

2023-06-12 Thread Andreas Gruenbacher
Rename the SDF_FS_FROZEN flag to SDF_FREEZE_INITIATOR to indicate more
clearly that the node that has this flag set is the initiator of the
freeze.

Signed-off-by: Andreas Gruenbacher sd_flags);
-   wake_up_bit(>sd_flags, SDF_FS_FROZEN);
+   clear_bit_unlock(SDF_FREEZE_INITIATOR, >sd_flags);
+   wake_up_bit(>sd_flags, SDF_FREEZE_INITIATOR);
return;
 }
 
@@ -735,7 +735,7 @@ static int gfs2_freeze_super(struct super_block *sb)
fs_err(sdp, "retrying...\n");
msleep(1000);
}
-   set_bit(SDF_FS_FROZEN, >sd_flags);
+   set_bit(SDF_FREEZE_INITIATOR, >sd_flags);
 out:
mutex_unlock(>sd_freeze_mutex);
return error;
@@ -760,7 +760,7 @@ static int gfs2_thaw_super(struct super_block *sb)
 
gfs2_freeze_unlock(>sd_freeze_gh);
mutex_unlock(>sd_freeze_mutex);
-   return wait_on_bit(>sd_flags, SDF_FS_FROZEN, TASK_INTERRUPTIBLE);
+   return wait_on_bit(>sd_flags, SDF_FREEZE_INITIATOR, 
TASK_INTERRUPTIBLE);
 }
 
 /**
diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index 454dc2ff8b5e..bc752de01573 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -111,7 +111,7 @@ static ssize_t status_show(struct gfs2_sbd *sdp, char *buf)
 test_bit(SDF_RORECOVERY, ),
 test_bit(SDF_SKIP_DLM_UNLOCK, ),
 test_bit(SDF_FORCE_AIL_FLUSH, ),
-test_bit(SDF_FS_FROZEN, ),
+test_bit(SDF_FREEZE_INITIATOR, ),
 test_bit(SDF_WITHDRAWING, ),
 test_bit(SDF_WITHDRAW_IN_PROG, ),
 test_bit(SDF_REMOTE_WITHDRAW, ),
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 817c4f42ae7b..1743caee5505 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -187,7 +187,7 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp)
}
sdp->sd_jinode_gh.gh_flags |= GL_NOCACHE;
gfs2_glock_dq(>sd_jinode_gh);
-   if (test_bit(SDF_FS_FROZEN, >sd_flags)) {
+   if (test_bit(SDF_FREEZE_INITIATOR, >sd_flags)) {
/* Make sure gfs2_thaw_super works if partially-frozen */
flush_work(>sd_freeze_work);
atomic_set(>sd_freeze_state, SFS_FROZEN);
-- 
2.40.0



[Cluster-devel] [PATCH 4/8] gfs2: Reconfiguring frozen filesystem already rejected

2023-06-12 Thread Andreas Gruenbacher
Reconfiguring a frozen filesystem is already rejected in
reconfigure_super(), so there is no need to check for that condition
again at the filesystem level.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/ops_fstype.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 4fb25496e92f..12eedba45dbb 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1590,12 +1590,6 @@ static int gfs2_reconfigure(struct fs_context *fc)
fc->sb_flags |= SB_RDONLY;
 
if ((sb->s_flags ^ fc->sb_flags) & SB_RDONLY) {
-   struct gfs2_holder freeze_gh;
-
-   error = gfs2_freeze_lock_shared(sdp, _gh, 0);
-   if (error)
-   return -EINVAL;
-
if (fc->sb_flags & SB_RDONLY) {
gfs2_make_fs_ro(sdp);
} else {
@@ -1603,7 +1597,6 @@ static int gfs2_reconfigure(struct fs_context *fc)
if (error)
errorfc(fc, "unable to remount read-write");
}
-   gfs2_freeze_unlock(_gh);
}
sdp->sd_args = *newargs;
 
-- 
2.40.0



[Cluster-devel] [PATCH 3/8] gfs2: Rename gfs2_freeze_lock{ => _shared }

2023-06-12 Thread Andreas Gruenbacher
Rename gfs2_freeze_lock to gfs2_freeze_lock_shared to make it a bit more
obvious that this function establishes the "thawed" state of the freeze
glock.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/ops_fstype.c |  4 ++--
 fs/gfs2/recovery.c   |  2 +-
 fs/gfs2/super.c  |  2 +-
 fs/gfs2/util.c   | 10 +-
 fs/gfs2/util.h   |  5 +++--
 5 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 3b93e4a22dfd..4fb25496e92f 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1269,7 +1269,7 @@ static int gfs2_fill_super(struct super_block *sb, struct 
fs_context *fc)
}
}
 
-   error = gfs2_freeze_lock(sdp, _gh, 0);
+   error = gfs2_freeze_lock_shared(sdp, _gh, 0);
if (error)
goto fail_per_node;
 
@@ -1592,7 +1592,7 @@ static int gfs2_reconfigure(struct fs_context *fc)
if ((sb->s_flags ^ fc->sb_flags) & SB_RDONLY) {
struct gfs2_holder freeze_gh;
 
-   error = gfs2_freeze_lock(sdp, _gh, 0);
+   error = gfs2_freeze_lock_shared(sdp, _gh, 0);
if (error)
return -EINVAL;
 
diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index d8e522f389aa..61ef07da40b2 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -470,7 +470,7 @@ void gfs2_recover_func(struct work_struct *work)
 
/* Acquire a shared hold on the freeze glock */
 
-   error = gfs2_freeze_lock(sdp, _gh, LM_FLAG_PRIORITY);
+   error = gfs2_freeze_lock_shared(sdp, _gh, 
LM_FLAG_PRIORITY);
if (error)
goto fail_gunlock_ji;
 
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index dbda7e253a9c..0fecd1887d0c 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -677,7 +677,7 @@ void gfs2_freeze_func(struct work_struct *work)
struct super_block *sb = sdp->sd_vfs;
 
atomic_inc(>s_active);
-   error = gfs2_freeze_lock(sdp, _gh, 0);
+   error = gfs2_freeze_lock_shared(sdp, _gh, 0);
if (error) {
gfs2_assert_withdraw(sdp, 0);
} else {
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 16ac3b3e7b88..817c4f42ae7b 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -93,13 +93,13 @@ int check_journal_clean(struct gfs2_sbd *sdp, struct 
gfs2_jdesc *jd,
 }
 
 /**
- * gfs2_freeze_lock - hold the freeze glock
+ * gfs2_freeze_lock_shared - hold the freeze glock
  * @sdp: the superblock
  * @freeze_gh: pointer to the requested holder
  * @caller_flags: any additional flags needed by the caller
  */
-int gfs2_freeze_lock(struct gfs2_sbd *sdp, struct gfs2_holder *freeze_gh,
-int caller_flags)
+int gfs2_freeze_lock_shared(struct gfs2_sbd *sdp, struct gfs2_holder 
*freeze_gh,
+   int caller_flags)
 {
int flags = LM_FLAG_NOEXP | GL_EXACT | caller_flags;
int error;
@@ -157,8 +157,8 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp)
gfs2_holder_mark_uninitialized(_gh);
if (sdp->sd_freeze_gl &&
!gfs2_glock_is_locked_by_me(sdp->sd_freeze_gl)) {
-   ret = gfs2_freeze_lock(sdp, _gh,
-  log_write_allowed ? 0 : LM_FLAG_TRY);
+   ret = gfs2_freeze_lock_shared(sdp, _gh,
+   log_write_allowed ? 0 : LM_FLAG_TRY);
if (ret == GLR_TRYFAILED)
ret = 0;
}
diff --git a/fs/gfs2/util.h b/fs/gfs2/util.h
index 78ec190f4155..3291e33e81e9 100644
--- a/fs/gfs2/util.h
+++ b/fs/gfs2/util.h
@@ -149,8 +149,9 @@ int gfs2_io_error_i(struct gfs2_sbd *sdp, const char 
*function,
 
 extern int check_journal_clean(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd,
   bool verbose);
-extern int gfs2_freeze_lock(struct gfs2_sbd *sdp,
-   struct gfs2_holder *freeze_gh, int caller_flags);
+extern int gfs2_freeze_lock_shared(struct gfs2_sbd *sdp,
+  struct gfs2_holder *freeze_gh,
+  int caller_flags);
 extern void gfs2_freeze_unlock(struct gfs2_holder *freeze_gh);
 
 #define gfs2_io_error(sdp) \
-- 
2.40.0



[Cluster-devel] [PATCH 2/8] gfs2: Rename the {freeze, thaw}_super callbacks

2023-06-12 Thread Andreas Gruenbacher
Rename gfs2_freeze to gfs2_freeze_super and gfs2_unfreeze to
gfs2_thaw_super to match the names of the corresponding super
operations.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/super.c | 12 ++--
 fs/gfs2/util.c  |  2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 03f45711100d..dbda7e253a9c 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -697,12 +697,12 @@ void gfs2_freeze_func(struct work_struct *work)
 }
 
 /**
- * gfs2_freeze - prevent further writes to the filesystem
+ * gfs2_freeze_super - prevent further writes to the filesystem
  * @sb: the VFS structure for the filesystem
  *
  */
 
-static int gfs2_freeze(struct super_block *sb)
+static int gfs2_freeze_super(struct super_block *sb)
 {
struct gfs2_sbd *sdp = sb->s_fs_info;
int error;
@@ -742,12 +742,12 @@ static int gfs2_freeze(struct super_block *sb)
 }
 
 /**
- * gfs2_unfreeze - reallow writes to the filesystem
+ * gfs2_thaw_super - reallow writes to the filesystem
  * @sb: the VFS structure for the filesystem
  *
  */
 
-static int gfs2_unfreeze(struct super_block *sb)
+static int gfs2_thaw_super(struct super_block *sb)
 {
struct gfs2_sbd *sdp = sb->s_fs_info;
 
@@ -1526,8 +1526,8 @@ const struct super_operations gfs2_super_ops = {
.evict_inode= gfs2_evict_inode,
.put_super  = gfs2_put_super,
.sync_fs= gfs2_sync_fs,
-   .freeze_super   = gfs2_freeze,
-   .thaw_super = gfs2_unfreeze,
+   .freeze_super   = gfs2_freeze_super,
+   .thaw_super = gfs2_thaw_super,
.statfs = gfs2_statfs,
.drop_inode = gfs2_drop_inode,
.show_options   = gfs2_show_options,
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index c84242ef4903..16ac3b3e7b88 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -188,7 +188,7 @@ static void signal_our_withdraw(struct gfs2_sbd *sdp)
sdp->sd_jinode_gh.gh_flags |= GL_NOCACHE;
gfs2_glock_dq(>sd_jinode_gh);
if (test_bit(SDF_FS_FROZEN, >sd_flags)) {
-   /* Make sure gfs2_unfreeze works if partially-frozen */
+   /* Make sure gfs2_thaw_super works if partially-frozen */
flush_work(>sd_freeze_work);
atomic_set(>sd_freeze_state, SFS_FROZEN);
thaw_super(sdp->sd_vfs);
-- 
2.40.0



[Cluster-devel] [PATCH 1/8] gfs2: Rename remaining "transaction" glock references

2023-06-12 Thread Andreas Gruenbacher
The transaction glock was repurposed to serve as the new freeze glock
years ago.  Don't refer to it as the transaction glock anymore.

Also, to be more precise, call it the "freeze glock" instead of the
"freeze lock".  Ditto for the journal glock.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/glock.c  | 4 ++--
 fs/gfs2/ops_fstype.c | 2 +-
 fs/gfs2/recovery.c   | 8 
 fs/gfs2/super.c  | 2 +-
 fs/gfs2/util.c   | 2 +-
 5 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 5adc7d85dbf3..1438e7465e30 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -145,8 +145,8 @@ static void gfs2_glock_dealloc(struct rcu_head *rcu)
  *
  * We need to allow some glocks to be enqueued, dequeued, promoted, and demoted
  * when we're withdrawn. For example, to maintain metadata integrity, we should
- * disallow the use of inode and rgrp glocks when withdrawn. Other glocks, like
- * iopen or the transaction glocks may be safely used because none of their
+ * disallow the use of inode and rgrp glocks when withdrawn. Other glocks like
+ * the iopen or freeze glock may be safely used because none of their
  * metadata goes through the journal. So in general, we should disallow all
  * glocks that are journaled, and allow all the others. One exception is:
  * we need to allow our active journal to be promoted and demoted so others
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 9af9ddb61ca0..3b93e4a22dfd 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -434,7 +434,7 @@ static int init_locking(struct gfs2_sbd *sdp, struct 
gfs2_holder *mount_gh,
error = gfs2_glock_get(sdp, GFS2_FREEZE_LOCK, _freeze_glops,
   CREATE, >sd_freeze_gl);
if (error) {
-   fs_err(sdp, "can't create transaction glock: %d\n", error);
+   fs_err(sdp, "can't create freeze glock: %d\n", error);
goto fail_rename;
}
 
diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index 2bb085a72e8e..d8e522f389aa 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -420,10 +420,10 @@ void gfs2_recover_func(struct work_struct *work)
if (sdp->sd_args.ar_spectator)
goto fail;
if (jd->jd_jid != sdp->sd_lockstruct.ls_jid) {
-   fs_info(sdp, "jid=%u: Trying to acquire journal lock...\n",
+   fs_info(sdp, "jid=%u: Trying to acquire journal glock...\n",
jd->jd_jid);
jlocked = 1;
-   /* Acquire the journal lock so we can do recovery */
+   /* Acquire the journal glock so we can do recovery */
 
error = gfs2_glock_nq_num(sdp, jd->jd_jid, _journal_glops,
  LM_ST_EXCLUSIVE,
@@ -465,10 +465,10 @@ void gfs2_recover_func(struct work_struct *work)
ktime_ms_delta(t_jhd, t_jlck));
 
if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) {
-   fs_info(sdp, "jid=%u: Acquiring the transaction lock...\n",
+   fs_info(sdp, "jid=%u: Acquiring the freeze glock...\n",
jd->jd_jid);
 
-   /* Acquire a shared hold on the freeze lock */
+   /* Acquire a shared hold on the freeze glock */
 
error = gfs2_freeze_lock(sdp, _gh, LM_FLAG_PRIORITY);
if (error)
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 3a7e7c31eb9c..03f45711100d 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -463,7 +463,7 @@ static int gfs2_write_inode(struct inode *inode, struct 
writeback_control *wbc)
  * @flags: The type of dirty
  *
  * Unfortunately it can be called under any combination of inode
- * glock and transaction lock, so we have to check carefully.
+ * glock and freeze glock, so we have to check carefully.
  *
  * At the moment this deals only with atime - it should be possible
  * to expand that role in future, once a review of the locking has
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 7a6aeffcdf5c..c84242ef4903 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -107,7 +107,7 @@ int gfs2_freeze_lock(struct gfs2_sbd *sdp, struct 
gfs2_holder *freeze_gh,
error = gfs2_glock_nq_init(sdp->sd_freeze_gl, LM_ST_SHARED, flags,
   freeze_gh);
if (error && error != GLR_TRYFAILED)
-   fs_err(sdp, "can't lock the freeze lock: %d\n", error);
+   fs_err(sdp, "can't lock the freeze glock: %d\n", error);
return error;
 }
 
-- 
2.40.0



[Cluster-devel] [PATCH 0/8] gfs2: Rework freeze / thaw logic

2023-06-12 Thread Andreas Gruenbacher
Currently, filesystem freeze is implemented inside freeze_go_sync(), a
glock state transition callback.  It turns out that in some scenarios,
freezing the filesystem in such a callback can deadlock.  To prevent
that from happening, rework the freeze / thaw logic and move it out of
the glock state engine.

Andreas Gruenbacher (8):
  gfs2: Rename remaining "transaction" glock references
  gfs2: Rename the {freeze,thaw}_super callbacks
  gfs2: Rename gfs2_freeze_lock{ => _shared }
  gfs2: Reconfiguring frozen filesystem already rejected
  gfs2: Rename SDF_{FS_FROZEN => FREEZE_INITIATOR}
  gfs2: Rework freeze / thaw logic
  gfs2: Replace sd_freeze_state with SDF_FROZEN flag
  gfs2: gfs2_freeze_lock_shared cleanup

 fs/gfs2/glock.c  |   4 +-
 fs/gfs2/glops.c  |  52 +---
 fs/gfs2/incore.h |  10 +--
 fs/gfs2/log.c|  11 +--
 fs/gfs2/ops_fstype.c |  15 +---
 fs/gfs2/recovery.c   |  28 +++
 fs/gfs2/super.c  | 186 +--
 fs/gfs2/super.h  |   1 +
 fs/gfs2/sys.c|   4 +-
 fs/gfs2/trans.c  |   3 +-
 fs/gfs2/util.c   |  45 ---
 fs/gfs2/util.h   |   3 +-
 12 files changed, 206 insertions(+), 156 deletions(-)

-- 
2.40.0



Re: [Cluster-devel] [PATCH] gfs2: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method

2023-06-12 Thread Andreas Gruenbacher
On Mon, Jun 12, 2023 at 7:54 AM Christoph Hellwig  wrote:
> Since commit a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag") file
> systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
> wiring up a dummy direct_IO method to indicate support for direct I/O.
>
> Remove .direct_IO from gfs2_aops, and set FMODE_CAN_ODIRECT in
> gfs2_open_common for regular files that do not use data journalling.

Thanks, added to for-next.

Andreas

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/gfs2/aops.c | 1 -
>  fs/gfs2/file.c | 3 +++
>  2 files changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index a5f4be6b9213ed..d95125714ebb38 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -750,7 +750,6 @@ static const struct address_space_operations gfs2_aops = {
> .release_folio = iomap_release_folio,
> .invalidate_folio = iomap_invalidate_folio,
> .bmap = gfs2_bmap,
> -   .direct_IO = noop_direct_IO,
> .migrate_folio = filemap_migrate_folio,
> .is_partially_uptodate = iomap_is_partially_uptodate,
> .error_remove_page = generic_error_remove_page,
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 300844f50dcd28..dcb2b7dd2269cf 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -630,6 +630,9 @@ int gfs2_open_common(struct inode *inode, struct file 
> *file)
> ret = generic_file_open(inode, file);
> if (ret)
> return ret;
> +
> +   if (!gfs2_is_jdata(GFS2_I(inode)))
> +   file->f_mode |= FMODE_CAN_ODIRECT;
> }
>
> fp = kzalloc(sizeof(struct gfs2_file), GFP_NOFS);
> --
> 2.39.2
>



Re: [Cluster-devel] [PATCH 7/9] gfs2: update ctime when quota is updated

2023-06-09 Thread Andreas Gruenbacher
Jeff,

On Fri, Jun 9, 2023 at 2:50 PM Jeff Layton  wrote:
> Signed-off-by: Jeff Layton 
> ---
>  fs/gfs2/quota.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> index 1ed17226d9ed..6d283e071b90 100644
> --- a/fs/gfs2/quota.c
> +++ b/fs/gfs2/quota.c
> @@ -869,7 +869,7 @@ static int gfs2_adjust_quota(struct gfs2_inode *ip, 
> loff_t loc,
> size = loc + sizeof(struct gfs2_quota);
> if (size > inode->i_size)
> i_size_write(inode, size);
> -   inode->i_mtime = inode->i_atime = current_time(inode);
> +   inode->i_mtime = inode->i_atime = inode->i_ctime = 
> current_time(inode);

I don't think we need to worry about the ctime of the quota inode as
that inode is internal to the filesystem only.

> mark_inode_dirty(inode);
> set_bit(QDF_REFRESH, >qd_flags);
> }
> --
> 2.40.1
>

Thanks,
Andreas



Re: [Cluster-devel] [GIT PULL] gfs2 fix

2023-06-06 Thread Andreas Gruenbacher
On Tue, Jun 6, 2023 at 2:55 PM Linus Torvalds
 wrote:
> On Tue, Jun 6, 2023 at 5:48 AM Andreas Gruenbacher  
> wrote:
> >
> > - Don't get stuck writing page onto itself under direct I/O.
>
> Btw, is there a test for this DIO case?

The previous test case I wrote for these kinds of page faults is:

  "generic: Test page faults during read and write"
  https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=d3cbdabff

I've added a check for this specific case, but this change hasn't been
posted/merged yet:

  "generic/728: Add mmap + DIO write test"
  https://gitlab.com/agruenba/xfstests/-/commit/8c37de03

> We've had the deadlock issue on t page lock (or for inode locks or
> whatever) for normal IO when faulting in the same page that is written
> to, and we have as pattern for solving that and I think there are
> filesystem tests that trigger this.
>
> But the DIO pattern is a bit different, with the whole "invalidate
> page cache: issue, and the fact that you send this patch now (rather
> than years ago) makes me wonder about test coverage for this all?

Yes, this case wasn't covered so far. The other page fault issues are
covered since 2021, and were fixed in gfs2 back then.

Thanks,
Andreas



[Cluster-devel] [GIT PULL] gfs2 fix

2023-06-06 Thread Andreas Gruenbacher
Hi Linus,

please consider pulling the following fix.

Thanks,
Andreas

The following changes since commit 48b1320a674e1ff5de2fad8606bee38f724594dc:

  Merge tag 'for-6.4-rc4-tag' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux (2023-05-30 17:23:50 
-0400)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
tags/gfs2-v6.4-rc4-fix

for you to fetch changes up to fa58cc888d67e640e354d8b3ceef877ea167b0cf:

  gfs2: Don't get stuck writing page onto itself under direct I/O (2023-06-01 
14:55:43 +0200)


gfs2 fix

- Don't get stuck writing page onto itself under direct I/O.


Andreas Gruenbacher (1):
  gfs2: Don't get stuck writing page onto itself under direct I/O

 fs/gfs2/file.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)



Re: [Cluster-devel] [PATCH 3/6] gfs2: Convert gfs2_write_jdata_page() to gfs2_write_jdata_folio()

2023-06-04 Thread Andreas Gruenbacher
On Sun, Jun 4, 2023 at 5:38 AM Matthew Wilcox  wrote:
>
> On Sat, Jun 03, 2023 at 11:34:14AM +0200, Andreas Gruenbacher wrote:
> > >   * This is the same as calling block_write_full_page, but it also
> > >   * writes pages outside of i_size
> > >   */
> > > -static int gfs2_write_jdata_page(struct page *page,
> > > +static int gfs2_write_jdata_folio(struct folio *folio,
> > >  struct writeback_control *wbc)
> > >  {
> > > -   struct inode * const inode = page->mapping->host;
> > > +   struct inode * const inode = folio->mapping->host;
> > > loff_t i_size = i_size_read(inode);
> > > -   const pgoff_t end_index = i_size >> PAGE_SHIFT;
> > > -   unsigned offset;
> > >
> > > +   if (folio_pos(folio) >= i_size)
> > > +   return 0;
> >
> > Function gfs2_write_jdata_page was originally introduced as
> > gfs2_write_full_page in commit fd4c5748b8d3 ("gfs2: writeout truncated
> > pages") to allow writing pages even when they are beyond EOF, as the
> > function description documents.
>
> Well, that was stupid of me.
>
> > This hack was added because simply skipping journaled pages isn't
> > enough on gfs2; before a journaled page can be freed, it needs to be
> > marked as "revoked" in the journal. Journal recovery will then skip
> > the revoked blocks, which allows them to be reused for regular,
> > non-journaled data. We can end up here in contexts in which we cannot
> > "revoke" pages, so instead, we write the original pages even when they
> > are beyond EOF. This hack could be revisited, but it's pretty nasty
> > code to pick apart.
> >
> > So at least the above if needs to go for now.
>
> Understood.  So we probably don't want to waste time zeroing the folio
> if it is entirely beyond i_size, right?  Because at the moment we'd
> zero some essentially random part of the folio if I just take out the
> check.  Should it look like this?
>
> if (folio_pos(folio) < i_size &&
> i_size < folio_pos(folio) + folio_size(folio))
>folio_zero_segment(folio, offset_in_folio(folio, i_size),
> folio_size(folio));

Yes, looking good, thanks.

If you haven't already, could you please consider my other comment as
well before you repost?

https://lore.kernel.org/linux-fsdevel/cahc6fu6gowptfx-mgriqqwzzj0r-85c9exc2pnkbkyscgut...@mail.gmail.com/

Thanks,
Andreas



Re: [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions

2023-06-03 Thread Andreas Gruenbacher
On Thu, Jun 1, 2023 at 9:10 PM Alexander Aring  wrote:
> Hi,
>
> On Thu, Jun 1, 2023 at 1:11 PM Andreas Gruenbacher  
> wrote:
> >
> > On Thu, Jun 1, 2023 at 6:28 PM Alexander Aring  wrote:
> > > Hi,
> > >
> > > On Tue, May 30, 2023 at 1:40 PM Andreas Gruenbacher  
> > > wrote:
> > > >
> > > > On Tue, May 30, 2023 at 4:08 PM Alexander Aring  
> > > > wrote:
> > > > > Hi,
> > > > >
> > > > > On Tue, May 30, 2023 at 7:01 AM Andreas Gruenbacher 
> > > > >  wrote:
> > > > > >
> > > > > > On Tue, May 30, 2023 at 12:19 AM Alexander Aring 
> > > > > >  wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Wed, May 24, 2023 at 6:02 PM Alexander Aring 
> > > > > > > >  wrote:
> > > > > > > > > This patch fixes a possible plock op collisions when using 
> > > > > > > > > F_SETLKW lock
> > > > > > > > > requests and fsid, number and owner are not enough to 
> > > > > > > > > identify a result
> > > > > > > > > for a pending request. The ltp testcases [0] and [1] are 
> > > > > > > > > examples when
> > > > > > > > > this is not enough in case of using classic posix locks with 
> > > > > > > > > threads and
> > > > > > > > > open filedescriptor posix locks.
> > > > > > > > >
> > > > > > > > > The idea to fix the issue here is to place all lock request 
> > > > > > > > > in order. In
> > > > > > > > > case of non F_SETLKW lock request (indicated if wait is set 
> > > > > > > > > or not) the
> > > > > > > > > lock requests are ordered inside the recv_list. If a result 
> > > > > > > > > comes back
> > > > > > > > > the right plock op can be found by the first plock_op in 
> > > > > > > > > recv_list which
> > > > > > > > > has not info.wait set. This can be done only by non F_SETLKW 
> > > > > > > > > plock ops as
> > > > > > > > > dlm_controld always reads a specific plock op 
> > > > > > > > > (list_move_tail() from
> > > > > > > > > send_list to recv_mlist) and write the result immediately 
> > > > > > > > > back.
> > > > > > > > >
> > > > > > > > > This behaviour is for F_SETLKW not possible as multiple 
> > > > > > > > > waiters can be
> > > > > > > > > get a result back in an random order. To avoid a collisions 
> > > > > > > > > in cases
> > > > > > > > > like [0] or [1] this patch adds more fields to compare the 
> > > > > > > > > plock
> > > > > > > > > operations as the lock request is the same. This is also 
> > > > > > > > > being made in
> > > > > > > > > NFS to find an result for an asynchronous F_SETLKW lock 
> > > > > > > > > request [2][3]. We
> > > > > > > > > still can't find the exact lock request for a specific result 
> > > > > > > > > if the
> > > > > > > > > lock request is the same, but if this is the case we don't 
> > > > > > > > > care the
> > > > > > > > > order how the identical lock requests get their result back 
> > > > > > > > > to grant the
> > > > > > > > > lock.
> > > > > > > >
> > > > > > > > When the recv_list contains multiple indistinguishable 
> > > > > > > > requests, this
> > > > > > > > can only be because they originated from multiple threads of 
> > > > > > > > the same
> > > > > > > > process. In that case, I agree that it doesn't matter which of 
> > > > > > > > those
> > > > > > > > requests we "c

Re: [Cluster-devel] [PATCH 5/6] gfs2: Support ludicrously large folios in gfs2_trans_add_databufs()

2023-06-03 Thread Andreas Gruenbacher
On Tue, May 23, 2023 at 3:37 PM Matthew Wilcox  wrote:
> On Tue, May 23, 2023 at 02:46:07PM +0200, Andreas Gruenbacher wrote:
> > >  void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio *folio,
> > > -unsigned int from, unsigned int len)
> > > +size_t from, size_t len)
> > >  {
> > > struct buffer_head *head = folio_buffers(folio);
> > > unsigned int bsize = head->b_size;
> >
> > This only makes sense if the to, start, and end variables in
> > gfs2_trans_add_databufs() are changed from unsigned int to size_t as
> > well.
>
> The history of this patch is that I started doing conversions from page
> -> folio in gfs2, then you came out with a very similar series.  This
> patch is the remainder after rebasing my patches on yours.  So we can
> either drop this patch or just apply it.  I wasn't making a concerted
> effort to make gfs2 support 4GB+ sized folios, it's just part of the
> conversion that I do.

Right. What do we do with these patches now, though? We probably don't
want to put them in the gfs2 tree given the buffer.c changes. Shall I
post a revised version? Will you?

> > >  extern void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio 
> > > *folio,
> > > -   unsigned int from, unsigned int len);
> > > +   size_t from, size_t len);

Thanks,
Andreas



Re: [Cluster-devel] [PATCH 3/6] gfs2: Convert gfs2_write_jdata_page() to gfs2_write_jdata_folio()

2023-06-03 Thread Andreas Gruenbacher
Hi Willy,

thanks for these patches. This particular one looks problematic:

On Wed, May 17, 2023 at 5:24 AM Matthew Wilcox (Oracle)
 wrote:
> This function now supports large folios, even if nothing around it does.
>
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/gfs2/aops.c | 27 ++-
>  1 file changed, 14 insertions(+), 13 deletions(-)
>
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index 749135252d52..0f92e3e117da 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -82,33 +82,34 @@ static int gfs2_get_block_noalloc(struct inode *inode, 
> sector_t lblock,
>  }
>
>  /**
> - * gfs2_write_jdata_page - gfs2 jdata-specific version of 
> block_write_full_page
> - * @page: The page to write
> + * gfs2_write_jdata_folio - gfs2 jdata-specific version of 
> block_write_full_page
> + * @folio: The folio to write
>   * @wbc: The writeback control
>   *
>   * This is the same as calling block_write_full_page, but it also
>   * writes pages outside of i_size
>   */
> -static int gfs2_write_jdata_page(struct page *page,
> +static int gfs2_write_jdata_folio(struct folio *folio,
>  struct writeback_control *wbc)
>  {
> -   struct inode * const inode = page->mapping->host;
> +   struct inode * const inode = folio->mapping->host;
> loff_t i_size = i_size_read(inode);
> -   const pgoff_t end_index = i_size >> PAGE_SHIFT;
> -   unsigned offset;
>
> +   if (folio_pos(folio) >= i_size)
> +   return 0;

Function gfs2_write_jdata_page was originally introduced as
gfs2_write_full_page in commit fd4c5748b8d3 ("gfs2: writeout truncated
pages") to allow writing pages even when they are beyond EOF, as the
function description documents.

This hack was added because simply skipping journaled pages isn't
enough on gfs2; before a journaled page can be freed, it needs to be
marked as "revoked" in the journal. Journal recovery will then skip
the revoked blocks, which allows them to be reused for regular,
non-journaled data. We can end up here in contexts in which we cannot
"revoke" pages, so instead, we write the original pages even when they
are beyond EOF. This hack could be revisited, but it's pretty nasty
code to pick apart.

So at least the above if needs to go for now.

> /*
> -* The page straddles i_size.  It must be zeroed out on each and every
> +* The folio straddles i_size.  It must be zeroed out on each and 
> every
>  * writepage invocation because it may be mmapped.  "A file is mapped
>  * in multiples of the page size.  For a file that is not a multiple 
> of
> -* the  page size, the remaining memory is zeroed when mapped, and
> +* the page size, the remaining memory is zeroed when mapped, and
>  * writes to that region are not written out to the file."
>  */
> -   offset = i_size & (PAGE_SIZE - 1);
> -   if (page->index == end_index && offset)
> -   zero_user_segment(page, offset, PAGE_SIZE);
> +   if (i_size < folio_pos(folio) + folio_size(folio))
> +   folio_zero_segment(folio, offset_in_folio(folio, i_size),
> +   folio_size(folio));
>
> -   return __block_write_full_page(inode, page, gfs2_get_block_noalloc, 
> wbc,
> +   return __block_write_full_page(inode, >page,
> +  gfs2_get_block_noalloc, wbc,
>end_buffer_async_write);
>  }
>
> @@ -137,7 +138,7 @@ static int __gfs2_jdata_write_folio(struct folio *folio,
> }
> gfs2_trans_add_databufs(ip, folio, 0, folio_size(folio));
> }
> -   return gfs2_write_jdata_page(>page, wbc);
> +   return gfs2_write_jdata_folio(folio, wbc);
>  }
>
>  /**
> --
> 2.39.2
>

Thanks,
Andreas



Re: [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions

2023-06-01 Thread Andreas Gruenbacher
On Thu, Jun 1, 2023 at 6:28 PM Alexander Aring  wrote:
> Hi,
>
> On Tue, May 30, 2023 at 1:40 PM Andreas Gruenbacher  
> wrote:
> >
> > On Tue, May 30, 2023 at 4:08 PM Alexander Aring  wrote:
> > > Hi,
> > >
> > > On Tue, May 30, 2023 at 7:01 AM Andreas Gruenbacher  
> > > wrote:
> > > >
> > > > On Tue, May 30, 2023 at 12:19 AM Alexander Aring  
> > > > wrote:
> > > > > Hi,
> > > > >
> > > > > On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
> > > > >  wrote:
> > > > > >
> > > > > > On Wed, May 24, 2023 at 6:02 PM Alexander Aring 
> > > > > >  wrote:
> > > > > > > This patch fixes a possible plock op collisions when using 
> > > > > > > F_SETLKW lock
> > > > > > > requests and fsid, number and owner are not enough to identify a 
> > > > > > > result
> > > > > > > for a pending request. The ltp testcases [0] and [1] are examples 
> > > > > > > when
> > > > > > > this is not enough in case of using classic posix locks with 
> > > > > > > threads and
> > > > > > > open filedescriptor posix locks.
> > > > > > >
> > > > > > > The idea to fix the issue here is to place all lock request in 
> > > > > > > order. In
> > > > > > > case of non F_SETLKW lock request (indicated if wait is set or 
> > > > > > > not) the
> > > > > > > lock requests are ordered inside the recv_list. If a result comes 
> > > > > > > back
> > > > > > > the right plock op can be found by the first plock_op in 
> > > > > > > recv_list which
> > > > > > > has not info.wait set. This can be done only by non F_SETLKW 
> > > > > > > plock ops as
> > > > > > > dlm_controld always reads a specific plock op (list_move_tail() 
> > > > > > > from
> > > > > > > send_list to recv_mlist) and write the result immediately back.
> > > > > > >
> > > > > > > This behaviour is for F_SETLKW not possible as multiple waiters 
> > > > > > > can be
> > > > > > > get a result back in an random order. To avoid a collisions in 
> > > > > > > cases
> > > > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > > > operations as the lock request is the same. This is also being 
> > > > > > > made in
> > > > > > > NFS to find an result for an asynchronous F_SETLKW lock request 
> > > > > > > [2][3]. We
> > > > > > > still can't find the exact lock request for a specific result if 
> > > > > > > the
> > > > > > > lock request is the same, but if this is the case we don't care 
> > > > > > > the
> > > > > > > order how the identical lock requests get their result back to 
> > > > > > > grant the
> > > > > > > lock.
> > > > > >
> > > > > > When the recv_list contains multiple indistinguishable requests, 
> > > > > > this
> > > > > > can only be because they originated from multiple threads of the 
> > > > > > same
> > > > > > process. In that case, I agree that it doesn't matter which of those
> > > > > > requests we "complete" in dev_write() as long as we only complete 
> > > > > > one
> > > > > > request. We do need to compare the additional request fields in
> > > > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > > > We need to compare all of the fields that identify a request 
> > > > > > (optype,
> > > > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > > > "right" request (or in case there is more than one identical 
> > > > > > request,
> > > > > > a "suitable" request).
> > > > > >
> > > > >
> > > > > In my "definition" why this works is as you said the "identical
> > > > > request". There is a more deeper definition of "when is a request
> 

Re: [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions

2023-05-30 Thread Andreas Gruenbacher
On Tue, May 30, 2023 at 4:08 PM Alexander Aring  wrote:
> Hi,
>
> On Tue, May 30, 2023 at 7:01 AM Andreas Gruenbacher  
> wrote:
> >
> > On Tue, May 30, 2023 at 12:19 AM Alexander Aring  
> > wrote:
> > > Hi,
> > >
> > > On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
> > >  wrote:
> > > >
> > > > On Wed, May 24, 2023 at 6:02 PM Alexander Aring  
> > > > wrote:
> > > > > This patch fixes a possible plock op collisions when using F_SETLKW 
> > > > > lock
> > > > > requests and fsid, number and owner are not enough to identify a 
> > > > > result
> > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > this is not enough in case of using classic posix locks with threads 
> > > > > and
> > > > > open filedescriptor posix locks.
> > > > >
> > > > > The idea to fix the issue here is to place all lock request in order. 
> > > > > In
> > > > > case of non F_SETLKW lock request (indicated if wait is set or not) 
> > > > > the
> > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > the right plock op can be found by the first plock_op in recv_list 
> > > > > which
> > > > > has not info.wait set. This can be done only by non F_SETLKW plock 
> > > > > ops as
> > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > send_list to recv_mlist) and write the result immediately back.
> > > > >
> > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > operations as the lock request is the same. This is also being made in
> > > > > NFS to find an result for an asynchronous F_SETLKW lock request 
> > > > > [2][3]. We
> > > > > still can't find the exact lock request for a specific result if the
> > > > > lock request is the same, but if this is the case we don't care the
> > > > > order how the identical lock requests get their result back to grant 
> > > > > the
> > > > > lock.
> > > >
> > > > When the recv_list contains multiple indistinguishable requests, this
> > > > can only be because they originated from multiple threads of the same
> > > > process. In that case, I agree that it doesn't matter which of those
> > > > requests we "complete" in dev_write() as long as we only complete one
> > > > request. We do need to compare the additional request fields in
> > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > We need to compare all of the fields that identify a request (optype,
> > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > "right" request (or in case there is more than one identical request,
> > > > a "suitable" request).
> > > >
> > >
> > > In my "definition" why this works is as you said the "identical
> > > request". There is a more deeper definition of "when is a request
> > > identical" and in my opinion it is here as: "A request A is identical
> > > to request B when they get granted under the same 'time'" which is all
> > > the fields you mentioned.
> > >
> > > Even with cancellation (F_SETLKW only) it does not matter which
> > > "identical request" you cancel because the kernel and user
> > > (dlm_controld) makes no relation between a lock request instance. You
> > > need to have at least the same amount of "results" coming back from
> > > user space as the amount you are waiting for a result for the same
> > > "identical request".
> >
> > That's not incorrect per se, but cancellations create an additional
> > difficulty: they can either succeed or fail. To indicate that a
> > cancellation has succeeded, a new type of message can be introduced
> > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > belong to a locking request that is being cancelled. When cancelling a
> > locking request fails, the kernel will see a "locking request granted"
>

Re: [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions

2023-05-30 Thread Andreas Gruenbacher
On Tue, May 30, 2023 at 12:19 AM Alexander Aring  wrote:
> Hi,
>
> On Thu, May 25, 2023 at 11:02 AM Andreas Gruenbacher
>  wrote:
> >
> > On Wed, May 24, 2023 at 6:02 PM Alexander Aring  wrote:
> > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > requests and fsid, number and owner are not enough to identify a result
> > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > this is not enough in case of using classic posix locks with threads and
> > > open filedescriptor posix locks.
> > >
> > > The idea to fix the issue here is to place all lock request in order. In
> > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > lock requests are ordered inside the recv_list. If a result comes back
> > > the right plock op can be found by the first plock_op in recv_list which
> > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > send_list to recv_mlist) and write the result immediately back.
> > >
> > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > get a result back in an random order. To avoid a collisions in cases
> > > like [0] or [1] this patch adds more fields to compare the plock
> > > operations as the lock request is the same. This is also being made in
> > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > still can't find the exact lock request for a specific result if the
> > > lock request is the same, but if this is the case we don't care the
> > > order how the identical lock requests get their result back to grant the
> > > lock.
> >
> > When the recv_list contains multiple indistinguishable requests, this
> > can only be because they originated from multiple threads of the same
> > process. In that case, I agree that it doesn't matter which of those
> > requests we "complete" in dev_write() as long as we only complete one
> > request. We do need to compare the additional request fields in
> > dev_write() to find a suitable request, so that makes sense as well.
> > We need to compare all of the fields that identify a request (optype,
> > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > "right" request (or in case there is more than one identical request,
> > a "suitable" request).
> >
>
> In my "definition" why this works is as you said the "identical
> request". There is a more deeper definition of "when is a request
> identical" and in my opinion it is here as: "A request A is identical
> to request B when they get granted under the same 'time'" which is all
> the fields you mentioned.
>
> Even with cancellation (F_SETLKW only) it does not matter which
> "identical request" you cancel because the kernel and user
> (dlm_controld) makes no relation between a lock request instance. You
> need to have at least the same amount of "results" coming back from
> user space as the amount you are waiting for a result for the same
> "identical request".

That's not incorrect per se, but cancellations create an additional
difficulty: they can either succeed or fail. To indicate that a
cancellation has succeeded, a new type of message can be introduced
(say, "CANCELLED"), and it's obvious that a CANCELLED message can only
belong to a locking request that is being cancelled. When cancelling a
locking request fails, the kernel will see a "locking request granted"
message though, and when multiple identical locking requests are
queued and only some of them have been cancelled, it won't be obvious
which locking request a "locking request granted" message should be
assigned to anymore. You really don't want to mix things up in that
case.

This complication makes it a whole lot more difficult to reason about
the correctness of the code. All that complexity is avoidable by
sticking with a fixed mapping of requests and replies (i.e., a unique
request identifier).

To put it differently, you can shoot yourself in the foot and still
hop along on the other leg, but it may not be the best of all possible
ideas.

> > The above patch description doesn't match the code anymore, and the
> > code doesn't fully revert the recv_list splitting of the previous
> > version.
> >
>
> This isn't a revert. Is it a new patch version, I did drop the
> recv_setlkw_list here, dropping in means of removing the
> recv_setlkw_list and handling everything in the recv_lis

Re: [Cluster-devel] [PATCH 07/11] iomap: update ki_pos in iomap_file_buffered_write

2023-05-25 Thread Andreas Gruenbacher
On Wed, May 24, 2023 at 8:54 AM Christoph Hellwig  wrote:
> All callers of iomap_file_buffered_write need to updated ki_pos, move it
> into common code.

Thanks for this set of cleanups, especially for the patch killing
current->backing_dev_info.

Reviewed-by: Andreas Gruenbacher 

> Signed-off-by: Christoph Hellwig 
> Acked-by: Damien Le Moal 
> Reviewed-by: Darrick J. Wong 
> ---
>  fs/gfs2/file.c | 4 +---
>  fs/iomap/buffered-io.c | 9 ++---
>  fs/xfs/xfs_file.c  | 2 --
>  fs/zonefs/file.c   | 4 +---
>  4 files changed, 8 insertions(+), 11 deletions(-)
>
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index 904a0d6ac1a1a9..c6a7555d5ad8bb 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -1044,10 +1044,8 @@ static ssize_t gfs2_file_buffered_write(struct kiocb 
> *iocb,
> pagefault_disable();
> ret = iomap_file_buffered_write(iocb, from, _iomap_ops);
> pagefault_enable();
> -   if (ret > 0) {
> -   iocb->ki_pos += ret;
> +   if (ret > 0)
> written += ret;
> -   }
>
> if (inode == sdp->sd_rindex)
> gfs2_glock_dq_uninit(statfs_gh);
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 063133ec77f49e..550525a525c45c 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -864,16 +864,19 @@ iomap_file_buffered_write(struct kiocb *iocb, struct 
> iov_iter *i,
> .len= iov_iter_count(i),
> .flags  = IOMAP_WRITE,
> };
> -   int ret;
> +   ssize_t ret;
>
> if (iocb->ki_flags & IOCB_NOWAIT)
> iter.flags |= IOMAP_NOWAIT;
>
> while ((ret = iomap_iter(, ops)) > 0)
> iter.processed = iomap_write_iter(, i);
> -   if (iter.pos == iocb->ki_pos)
> +
> +   if (unlikely(ret < 0))
> return ret;
> -   return iter.pos - iocb->ki_pos;
> +   ret = iter.pos - iocb->ki_pos;
> +   iocb->ki_pos += ret;
> +   return ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 431c3fd0e2b598..d57443db633637 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -720,8 +720,6 @@ xfs_file_buffered_write(
> trace_xfs_file_buffered_write(iocb, from);
> ret = iomap_file_buffered_write(iocb, from,
> _buffered_write_iomap_ops);
> -   if (likely(ret >= 0))
> -   iocb->ki_pos += ret;
>
> /*
>  * If we hit a space limit, try to free up some lingering preallocated
> diff --git a/fs/zonefs/file.c b/fs/zonefs/file.c
> index 132f01d3461f14..e212d0636f848e 100644
> --- a/fs/zonefs/file.c
> +++ b/fs/zonefs/file.c
> @@ -643,9 +643,7 @@ static ssize_t zonefs_file_buffered_write(struct kiocb 
> *iocb,
> goto inode_unlock;
>
> ret = iomap_file_buffered_write(iocb, from, _write_iomap_ops);
> -   if (ret > 0)
> -   iocb->ki_pos += ret;
> -   else if (ret == -EIO)
> +   if (ret == -EIO)
> zonefs_io_error(inode, true);
>
>  inode_unlock:
> --
> 2.39.2
>



Re: [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions

2023-05-25 Thread Andreas Gruenbacher
On Wed, May 24, 2023 at 6:02 PM Alexander Aring  wrote:
> This patch fixes a possible plock op collisions when using F_SETLKW lock
> requests and fsid, number and owner are not enough to identify a result
> for a pending request. The ltp testcases [0] and [1] are examples when
> this is not enough in case of using classic posix locks with threads and
> open filedescriptor posix locks.
>
> The idea to fix the issue here is to place all lock request in order. In
> case of non F_SETLKW lock request (indicated if wait is set or not) the
> lock requests are ordered inside the recv_list. If a result comes back
> the right plock op can be found by the first plock_op in recv_list which
> has not info.wait set. This can be done only by non F_SETLKW plock ops as
> dlm_controld always reads a specific plock op (list_move_tail() from
> send_list to recv_mlist) and write the result immediately back.
>
> This behaviour is for F_SETLKW not possible as multiple waiters can be
> get a result back in an random order. To avoid a collisions in cases
> like [0] or [1] this patch adds more fields to compare the plock
> operations as the lock request is the same. This is also being made in
> NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> still can't find the exact lock request for a specific result if the
> lock request is the same, but if this is the case we don't care the
> order how the identical lock requests get their result back to grant the
> lock.

When the recv_list contains multiple indistinguishable requests, this
can only be because they originated from multiple threads of the same
process. In that case, I agree that it doesn't matter which of those
requests we "complete" in dev_write() as long as we only complete one
request. We do need to compare the additional request fields in
dev_write() to find a suitable request, so that makes sense as well.
We need to compare all of the fields that identify a request (optype,
ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
"right" request (or in case there is more than one identical request,
a "suitable" request).

The above patch description doesn't match the code anymore, and the
code doesn't fully revert the recv_list splitting of the previous
version.

> [0] 
> https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl40.c
> [1] 
> https://gitlab.com/netcoder/ltp/-/blob/dlm_fcntl_owner_testcase/testcases/kernel/syscalls/fcntl/fcntl41.c
> [2] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/lockd/lockd.h?h=v6.4-rc1#n373
> [3] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svclock.c?h=v6.4-rc1#n731
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Alexander Aring 
> ---
> change since v2:
>  - don't split recv_list into recv_setlkw_list
>
>  fs/dlm/plock.c | 43 ++-
>  1 file changed, 30 insertions(+), 13 deletions(-)
>
> diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> index 31bc601ee3d8..53d17dbbb716 100644
> --- a/fs/dlm/plock.c
> +++ b/fs/dlm/plock.c
> @@ -391,7 +391,7 @@ static ssize_t dev_read(struct file *file, char __user 
> *u, size_t count,
> if (op->info.flags & DLM_PLOCK_FL_CLOSE)
> list_del(>list);
> else
> -   list_move(>list, _list);
> +   list_move_tail(>list, _list);

^ This should be obsolete, but it won't hurt, either.

> memcpy(, >info, sizeof(info));
> }
> spin_unlock(_lock);
> @@ -430,19 +430,36 @@ static ssize_t dev_write(struct file *file, const char 
> __user *u, size_t count,
> return -EINVAL;
>
> spin_lock(_lock);
> -   list_for_each_entry(iter, _list, list) {
> -   if (iter->info.fsid == info.fsid &&
> -   iter->info.number == info.number &&
> -   iter->info.owner == info.owner) {
> -   list_del_init(>list);
> -   memcpy(>info, , sizeof(info));
> -   if (iter->data)
> -   do_callback = 1;
> -   else
> -   iter->done = 1;
> -   op = iter;
> -   break;
> +   if (info.wait) {

We should be able to use the same list_for_each_entry() loop for
F_SETLKW requests (which have info.wait set) as for all other requests
as far as I can see.

> +   list_for_each_entry(iter, _list, list) {
> +   if (iter->info.fsid == info.fsid &&
> +   iter->info.number == info.number &&
> +   iter->info.owner == info.owner &&
> +   iter->info.pid == info.pid &&
> +   iter->info.start == info.start &&
> +   iter->info.end == info.end &&
> +  

Re: [Cluster-devel] [PATCH 5/6] gfs2: Support ludicrously large folios in gfs2_trans_add_databufs()

2023-05-23 Thread Andreas Gruenbacher
On Wed, May 17, 2023 at 5:24 AM Matthew Wilcox (Oracle)
 wrote:
> We may someday support folios larger than 4GB, so use a size_t for
> the byte count within a folio to prevent unpleasant truncations.
>
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/gfs2/aops.c | 2 +-
>  fs/gfs2/aops.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
> index e97462a5302e..8da4397aafc6 100644
> --- a/fs/gfs2/aops.c
> +++ b/fs/gfs2/aops.c
> @@ -38,7 +38,7 @@
>
>
>  void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio *folio,
> -unsigned int from, unsigned int len)
> +size_t from, size_t len)
>  {
> struct buffer_head *head = folio_buffers(folio);
> unsigned int bsize = head->b_size;

This only makes sense if the to, start, and end variables in
gfs2_trans_add_databufs() are changed from unsigned int to size_t as
well.

> diff --git a/fs/gfs2/aops.h b/fs/gfs2/aops.h
> index 09db1914425e..f08322ef41cf 100644
> --- a/fs/gfs2/aops.h
> +++ b/fs/gfs2/aops.h
> @@ -10,6 +10,6 @@
>
>  extern void adjust_fs_space(struct inode *inode);
>  extern void gfs2_trans_add_databufs(struct gfs2_inode *ip, struct folio 
> *folio,
> -   unsigned int from, unsigned int len);
> +   size_t from, size_t len);
>
>  #endif /* __AOPS_DOT_H__ */
> --
> 2.39.2
>

Thanks,
Andreas



[Cluster-devel] [GIT PULL] gfs2: Don't deref jdesc in evict

2023-05-10 Thread Andreas Gruenbacher
Hello Linus,

could you please pull the following gfs2 fix for 6.4?

Thanks,
Andreas

The following changes since commit e0fcc9c68d1147ca33159d57332b02ca8bac6ab9:

  Merge tag 'gfs2-v6.3-rc3-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 (2023-04-26 
09:28:15 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
tags/gfs2-v6.3-fix

for you to fetch changes up to 504a10d9e46bc37b23d0a1ae2f28973c8516e636:

  gfs2: Don't deref jdesc in evict (2023-05-10 17:15:18 +0200)


gfs2 fix

- Fix a NULL pointer dereference when mounting corrupted filesystems.


Bob Peterson (1):
  gfs2: Don't deref jdesc in evict

 fs/gfs2/super.c | 8 
 1 file changed, 8 insertions(+)



[Cluster-devel] [PATCH] dlm_controld: remove old build workaround

2023-05-03 Thread Andreas Gruenbacher
Remove the old build workaround from 2011, when DLM_PLOCK_FL_CLOSE
wasn't always defined in .

Signed-off-by: Andreas Gruenbacher 
---
 dlm_controld/plock.c | 20 +---
 1 file changed, 1 insertion(+), 19 deletions(-)

diff --git a/dlm_controld/plock.c b/dlm_controld/plock.c
index 7f632888..a91aecb0 100644
--- a/dlm_controld/plock.c
+++ b/dlm_controld/plock.c
@@ -9,15 +9,6 @@
 #include "dlm_daemon.h"
 #include 
 
-/* FIXME: remove this once everyone is using the version of
- * dlm_plock.h which defines it */
-
-#ifndef DLM_PLOCK_FL_CLOSE
-#warning DLM_PLOCK_FL_CLOSE undefined. Enabling build workaround.
-#define DLM_PLOCK_FL_CLOSE 1
-#define DLM_PLOCK_BUILD_WORKAROUND 1
-#endif
-
 static uint32_t plock_read_count;
 static uint32_t plock_recv_count;
 static uint32_t plock_rate_delays;
@@ -757,11 +748,7 @@ static void do_unlock(struct lockspace *ls, struct 
dlm_plock_info *in,
 
rv = unlock_internal(ls, r, in);
 
-#ifdef DLM_PLOCK_BUILD_WORKAROUND
-   if (in->pad & DLM_PLOCK_FL_CLOSE) {
-#else
if (in->flags & DLM_PLOCK_FL_CLOSE) {
-#endif
clear_waiters(ls, r, in);
/* no replies for unlock-close ops */
goto skip_result;
@@ -1595,13 +1582,8 @@ void process_plocks(int ci)
return;
 
  fail:
-#ifdef DLM_PLOCK_BUILD_WORKAROUND
-   if (!(info.pad & DLM_PLOCK_FL_CLOSE)) {
-#else
-   if (!(info.flags & DLM_PLOCK_FL_CLOSE)) {
-#endif
+   if (!(info.flags & DLM_PLOCK_FL_CLOSE))
write_result(, rv);
-   }
 }
 
 void process_saved_plocks(struct lockspace *ls)
-- 
2.40.0



[Cluster-devel] [GIT PULL] gfs2 fixes for 6.4

2023-04-25 Thread Andreas Gruenbacher
Hi Linus,

please consider pulling the following gfs2 fixes.

Thanks a lot,
Andreas

The following changes since commit 1e760fa3596e8c7f08412712c168288b79670d78:

  Merge tag 'gfs2-v6.3-rc3-fix' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 (2023-03-23 
15:25:49 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
tags/gfs2-v6.3-rc3-fixes

for you to fetch changes up to 644f6bf762fa903f64c59c2ec0f4d0d753527053:

  gfs2: gfs2_ail_empty_gl no log flush on error (2023-04-25 11:07:16 +0200)


gfs2 fixes

- Fix revoke processing at unmount and on read-only remount.

- Refuse reading in inodes with an impossible indirect block height.

- Various minor cleanups.


Andreas Gruenbacher (1):
  gfs2: Fix inode height consistency check

Andrew Price (3):
  gfs2: Remove duplicate i_nlink check from gfs2_link()
  gfs2: Remove ghs[] from gfs2_link
  gfs2: Remove ghs[] from gfs2_unlink

Bob Peterson (6):
  gfs2: Eliminate gfs2_trim_blocks
  gfs2: Use gfs2_holder_initialized for jindex
  gfs2: return errors from gfs2_ail_empty_gl
  gfs2: Perform second log flush in gfs2_make_fs_ro
  gfs2: Issue message when revokes cannot be written
  gfs2: gfs2_ail_empty_gl no log flush on error

Markus Elfring (1):
  gfs2: Move variable assignment behind a null pointer check in 
inode_go_dump

 fs/gfs2/bmap.c   |  8 
 fs/gfs2/bmap.h   |  1 -
 fs/gfs2/glops.c  | 23 +++
 fs/gfs2/inode.c  | 47 ++-
 fs/gfs2/ops_fstype.c |  9 +++--
 fs/gfs2/super.c  |  9 +
 6 files changed, 49 insertions(+), 48 deletions(-)



Re: [Cluster-devel] [PATCH 05/17] filemap: update ki_pos in generic_perform_write

2023-04-24 Thread Andreas Gruenbacher
On Mon, Apr 24, 2023 at 8:22 AM Christoph Hellwig  wrote:
> All callers of generic_perform_write need to updated ki_pos, move it into
> common code.

We've actually got a similar situation with
iomap_file_buffered_write() and its callers. Would it make sense to
fix that up as well?

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/ceph/file.c | 2 --
>  fs/ext4/file.c | 9 +++--
>  fs/f2fs/file.c | 1 -
>  fs/nfs/file.c  | 1 -
>  mm/filemap.c   | 8 
>  5 files changed, 7 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index f4d8bf7dec88a8..feeb9882ef635a 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1894,8 +1894,6 @@ static ssize_t ceph_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
>  * can not run at the same time
>  */
> written = generic_perform_write(iocb, from);
> -   if (likely(written >= 0))
> -   iocb->ki_pos = pos + written;
> ceph_end_io_write(inode);
> }
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 0b8b4499e5ca18..1026acaf1235a0 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -291,12 +291,9 @@ static ssize_t ext4_buffered_write_iter(struct kiocb 
> *iocb,
>
>  out:
> inode_unlock(inode);
> -   if (likely(ret > 0)) {
> -   iocb->ki_pos += ret;
> -   ret = generic_write_sync(iocb, ret);
> -   }
> -
> -   return ret;
> +   if (unlikely(ret <= 0))
> +   return ret;
> +   return generic_write_sync(iocb, ret);
>  }
>
>  static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t 
> offset,
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index f4ab23efcf85f8..5a9ae054b6da7d 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -4511,7 +4511,6 @@ static ssize_t f2fs_buffered_write_iter(struct kiocb 
> *iocb,
> current->backing_dev_info = NULL;
>
> if (ret > 0) {
> -   iocb->ki_pos += ret;
> f2fs_update_iostat(F2FS_I_SB(inode), inode,
> APP_BUFFERED_IO, ret);
> }
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 893625eacab9fa..abdae2b29369be 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -666,7 +666,6 @@ ssize_t nfs_file_write(struct kiocb *iocb, struct 
> iov_iter *from)
> goto out;
>
> written = result;
> -   iocb->ki_pos += written;
> nfs_add_stats(inode, NFSIOS_NORMALWRITTENBYTES, written);
>
> if (mntflags & NFS_MOUNT_WRITE_EAGER) {
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2723104cc06a12..0110bde3708b3f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3960,7 +3960,10 @@ ssize_t generic_perform_write(struct kiocb *iocb, 
> struct iov_iter *i)
> balance_dirty_pages_ratelimited(mapping);
> } while (iov_iter_count(i));
>
> -   return written ? written : status;
> +   if (!written)
> +   return status;
> +   iocb->ki_pos += written;

Could be turned into:
iocb->ki_pos = pos;

> +   return written;
>  }
>  EXPORT_SYMBOL(generic_perform_write);
>
> @@ -4039,7 +4042,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
> endbyte = pos + status - 1;
> err = filemap_write_and_wait_range(mapping, pos, endbyte);
> if (err == 0) {
> -   iocb->ki_pos = endbyte + 1;
> written += status;
> invalidate_mapping_pages(mapping,
>  pos >> PAGE_SHIFT,
> @@ -4052,8 +4054,6 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, 
> struct iov_iter *from)
> }
> } else {
> written = generic_perform_write(iocb, from);
> -   if (likely(written > 0))
> -   iocb->ki_pos += written;
> }
>  out:
> current->backing_dev_info = NULL;
> --
> 2.39.2
>

Thanks,
Andreas



Re: [Cluster-devel] [GFS2 PATCH 2/4] gfs2: Perform second log flush in gfs2_make_fs_ro

2023-04-24 Thread Andreas Gruenbacher
On Fri, Apr 21, 2023 at 9:07 PM Bob Peterson  wrote:
> Before this patch function gfs2_make_fs_ro called gfs2_log_flush once to
> finalize the log. However, if there's dirty metadata, log flushes tend
> to sync the metadata and formulate revokes. Before this patch, those
> revokes may not be written out to the journal immediately, which meant
> unresolved glocks could still have revokes in their ail lists. When the
> glock worker runs, it tries to transition the glock, but the unresolved
> revokes in the ail still need to be written, so it tries to start a
> transaction. It's impossible to start a transaction because at that
> point the GFS2_LOG_HEAD_FLUSH_SHUTDOWN has been set by gfs2_make_fs_ro.

Do you mean that at that point, the SDF_JOURNAL_LIVE flag has already
been cleared?

> That cause the glock worker to get an error, and unable to write the
> revokes. The calling sequence looked something like this:
>
> gfs2_make_fs_ro
>gfs2_log_flush - with GFS2_LOG_HEAD_FLUSH_SHUTDOWN flag set
> if (flags & GFS2_LOG_HEAD_FLUSH_SHUTDOWN)
> clear_bit(SDF_JOURNAL_LIVE, >sd_flags);
> ...meanwhile...
> glock_work_func
>do_xmote
>   rgrp_go_sync (or possibly inode_go_sync)
>  ...
>  gfs2_ail_empty_gl
> __gfs2_trans_begin
>if (unlikely(!test_bit(SDF_JOURNAL_LIVE, >sd_flags))) {
>...
>   return -EROFS;
>
> The previous patch in the series ("gfs2: return errors from
> gfs2_ail_empty_gl") now causes the transaction error to no longer be
> ignored, so it causes a warning from MOST of the xfstests:
>
> WARNING: CPU: 11 PID: X at fs/gfs2/super.c:603 gfs2_put_super [gfs2]
>
> which corresponds to:
>
> WARN_ON(gfs2_withdrawing(sdp));
>
> The withdraw was triggered silently from do_xmote by:
>
> if (unlikely(sdp->sd_log_error && !gfs2_withdrawn(sdp)))
> gfs2_withdraw_delayed(sdp);
>
> This patch adds a second log_flush to gfs2_make_fs_ro: one to sync the
> data and one to sync any outstanding revokes and finalize the journal.
> Note that both of these log flushes need to be "special," in other
> words, not GFS2_LOG_HEAD_FLUSH_NORMAL.
>
> Signed-off-by: Bob Peterson 
> ---
>  fs/gfs2/super.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index a83fa62106f0..5eed8c237500 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -552,6 +552,15 @@ void gfs2_make_fs_ro(struct gfs2_sbd *sdp)
> gfs2_quota_sync(sdp->sd_vfs, 0);
> gfs2_statfs_sync(sdp->sd_vfs, 0);
>
> +   /* We do two log flushes here. The first one commits dirty 
> inodes
> +* and rgrps to the journal, but queues up revokes to the ail 
> list.
> +* The second flush writes out and removes the revokes.
> +*
> +* The first must be done before the FLUSH_SHUTDOWN code
> +* clears the LIVE flag, otherwise it will not be able to 
> start
> +* a transaction to write its revokes, and the error will 
> cause
> +* a withdraw of the file system. */
> +   gfs2_log_flush(sdp, NULL, GFS2_LFC_MAKE_FS_RO);
> gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_SHUTDOWN |
>GFS2_LFC_MAKE_FS_RO);
> wait_event_timeout(sdp->sd_log_waitq,
> --
> 2.39.2
>

Thanks,
Andreas



Re: [Cluster-devel] [GFS2 PATCH 1/4] gfs2: return errors from gfs2_ail_empty_gl

2023-04-24 Thread Andreas Gruenbacher
On Fri, Apr 21, 2023 at 9:07 PM Bob Peterson  wrote:
> Before this patch, function gfs2_ail_empty_gl did not return errors it
> encountered from __gfs2_trans_begin. Those errors usually came from the
> fact that the file system was made read-only, often due to unmount,
> (but theoretically could be due to -o remount,ro) which prevented
> the transaction from starting.
>
> The inability to start a transaction prevented its revokes from being
> properly written to the journal for glocks during unmount (and
> transition to ro).
>
> That meant glocks could be unlocked without the metadata properly
> revoked in the journal. So other nodes could grab the glock thinking
> that their lvb values were correct, but in fact corresponded to the
> glock without its revokes properly synced. That presented as lvb
> mismatch errors.
>
> This patch allows gfs2_ail_empty_gl to return the error properly to
> the caller.

Alright,

> Signed-off-by: Bob Peterson 
> ---
>  fs/gfs2/glops.c | 16 +---
>  1 file changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
> index 6e33c8058059..8e245d793e6b 100644
> --- a/fs/gfs2/glops.c
> +++ b/fs/gfs2/glops.c
> @@ -90,7 +90,7 @@ static int gfs2_ail_empty_gl(struct gfs2_glock *gl)
> struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> struct gfs2_trans tr;
> unsigned int revokes;
> -   int ret;
> +   int ret = 0;
>
> revokes = atomic_read(>gl_ail_count);
>
> @@ -124,15 +124,15 @@ static int gfs2_ail_empty_gl(struct gfs2_glock *gl)
> memset(, 0, sizeof(tr));
> set_bit(TR_ONSTACK, _flags);
> ret = __gfs2_trans_begin(, sdp, 0, revokes, _RET_IP_);
> -   if (ret)
> -   goto flush;
> -   __gfs2_ail_flush(gl, 0, revokes);
> -   gfs2_trans_end(sdp);
> +   if (!ret) {
> +   __gfs2_ail_flush(gl, 0, revokes);
> +   gfs2_trans_end(sdp);
> +   }

but the above hunk doesn't help, so let me skip that.

>  flush:
> gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
>GFS2_LFC_AIL_EMPTY_GL);
> -   return 0;
> +   return ret;
>  }
>
>  void gfs2_ail_flush(struct gfs2_glock *gl, bool fsync)
> @@ -326,7 +326,9 @@ static int inode_go_sync(struct gfs2_glock *gl)
> ret = gfs2_inode_metasync(gl);
> if (!error)
> error = ret;
> -   gfs2_ail_empty_gl(gl);
> +   ret = gfs2_ail_empty_gl(gl);
> +   if (!error)
> +   error = ret;
> /*
>  * Writeback of the data mapping may cause the dirty flag to be set
>  * so we have to clear it again here.
> --
> 2.39.2
>

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v3 11/19] gfs: use __bio_add_page for adding single page to bio

2023-04-19 Thread Andreas Gruenbacher
On Wed, Apr 19, 2023 at 4:10 PM Johannes Thumshirn  wrote:
>
> From: Johannes Thumshirn 
>
> The GFS superblock reading code uses bio_add_page() to add a page to a
> newly created bio. bio_add_page() can fail, but the return value is never
> checked.

It's GFS2, not GFS, but otherwise this is obviously fine, thanks.

> Use __bio_add_page() as adding a single page to a newly created bio is
> guaranteed to succeed.
>
> This brings us a step closer to marking bio_add_page() as __must_check.
>
> Signed-off-by: Johannes Thumshirn 
> Reviewed-by: Damien Le Moal 

Reviewed-by: Andreas Gruenbacher 

> ---
>  fs/gfs2/ops_fstype.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index 6de901c3b89b..e0cd0d43b12f 100644
> --- a/fs/gfs2/ops_fstype.c
> +++ b/fs/gfs2/ops_fstype.c
> @@ -254,7 +254,7 @@ static int gfs2_read_super(struct gfs2_sbd *sdp, sector_t 
> sector, int silent)
>
> bio = bio_alloc(sb->s_bdev, 1, REQ_OP_READ | REQ_META, GFP_NOFS);
> bio->bi_iter.bi_sector = sector * (sb->s_blocksize >> 9);
> -   bio_add_page(bio, page, PAGE_SIZE, 0);
> +   __bio_add_page(bio, page, PAGE_SIZE, 0);
>
> bio->bi_end_io = end_bio_io_page;
> bio->bi_private = page;
> --
> 2.39.2
>



Re: [Cluster-devel] [PATCH] gfs2: Move a variable assignment behind a null pointer check in inode_go_dump()

2023-04-18 Thread Andreas Gruenbacher
Hi Markus,

On Thu, Apr 13, 2023 at 9:23 PM Markus Elfring  wrote:
> Date: Thu, 13 Apr 2023 20:54:30 +0200
>
> The address of a data structure member was determined before
> a corresponding null pointer check in the implementation of
> the function “inode_go_dump”.
>
> Thus avoid the risk for undefined behaviour by moving the assignment
> for the variable “inode” behind the null pointer check.
>
> This issue was detected by using the Coccinelle software.
>
> Fixes: 27a2660f1ef944724956d92e8a312b6da0936fae ("gfs2: Dump nrpages for 
> inodes and their glocks")

Okay, that's a worthwhile cleanup. It doesn't actually fix a bug, so
I'm not going to add a Fixes tag, though.

> Signed-off-by: Markus Elfring 
> ---
>  fs/gfs2/glops.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
> index b65950e76be5..6e33c8058059 100644
> --- a/fs/gfs2/glops.c
> +++ b/fs/gfs2/glops.c
> @@ -535,12 +535,13 @@ static void inode_go_dump(struct seq_file *seq, struct 
> gfs2_glock *gl,
>   const char *fs_id_buf)
>  {
> struct gfs2_inode *ip = gl->gl_object;
> -   struct inode *inode = >i_inode;
> +   struct inode *inode;
> unsigned long nrpages;
>
> if (ip == NULL)
> return;
>
> +   inode = >i_inode;
> xa_lock_irq(>i_data.i_pages);
> nrpages = inode->i_data.nrpages;
> xa_unlock_irq(>i_data.i_pages);
> --
> 2.40.0
>

Thanks,
Andreas



Re: [Cluster-devel] [PATCH] gfs2 FS: Fix UBSAN array-index-out-of-bounds in __gfs2_iomap_get

2023-03-27 Thread Andreas Gruenbacher
Hello Ivan,

On Wed, Mar 15, 2023 at 10:06 AM Ivan Orlov  wrote:
> Syzkaller reported the following issue:
>
> UBSAN: array-index-out-of-bounds in fs/gfs2/bmap.c:901:46
> index 11 is out of range for type 'u64 [11]'
> CPU: 0 PID: 5067 Comm: syz-executor164 Not tainted 
> 6.1.0-syzkaller-13031-g77856d911a8c #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 10/26/2022
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:88 [inline]
>  dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
>  ubsan_epilogue lib/ubsan.c:151 [inline]
>  __ubsan_handle_out_of_bounds+0xe0/0x110 lib/ubsan.c:282
>  __gfs2_iomap_get+0x4a4/0x16e0 fs/gfs2/bmap.c:901
>  gfs2_iomap_get fs/gfs2/bmap.c:1399 [inline]
>  gfs2_block_map+0x28f/0x7f0 fs/gfs2/bmap.c:1214
>  gfs2_write_alloc_required+0x441/0x6e0 fs/gfs2/bmap.c:2322
>  gfs2_jdesc_check+0x1b9/0x290 fs/gfs2/super.c:114
>  init_journal+0x5a4/0x22c0 fs/gfs2/ops_fstype.c:804
>  init_inodes+0xdc/0x340 fs/gfs2/ops_fstype.c:889
>  gfs2_fill_super+0x1bb2/0x2700 fs/gfs2/ops_fstype.c:1247
>  get_tree_bdev+0x400/0x620 fs/super.c:1282
>  gfs2_get_tree+0x50/0x210 fs/gfs2/ops_fstype.c:1330
>  vfs_get_tree+0x88/0x270 fs/super.c:1489
>  do_new_mount+0x289/0xad0 fs/namespace.c:3145
>  do_mount fs/namespace.c:3488 [inline]
>  __do_sys_mount fs/namespace.c:3697 [inline]
>  __se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f2c63567aca
> Code: 83 c4 08 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 c3 66 2e 0f 1f 84 00 00 
> 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7ffd0e3a28d8 EFLAGS: 0282 ORIG_RAX: 00a5
> RAX: ffda RBX: 0003 RCX: 7f2c63567aca
> RDX: 20037f40 RSI: 20037f80 RDI: 7ffd0e3a28e0
> RBP: 7ffd0e3a28e0 R08: 7ffd0e3a2920 R09: 00043350
> R10: 0211 R11: 0282 R12: 0004
> R13: 567192c0 R14: 7ffd0e3a2920 R15: 
>  
>
> This issue is caused by the 'while' loop in the '__gfs2_iomap_get' function,
> which increments 'height' var until it matches the condition. If height is
> larger or equal to 'sdp->sd_heightsize' array size (which is 
> GFS2_MAX_META_HEIGHT
> + 1), the array-index-out-of-bounds issue occurs. Therefore we need to add 
> extra
> condition to the while loop, which will prevent this issue.
>
> Additionally, if 'height' var after the while ending is equal to 
> GFS2_MAX_META_HEIGHT,
> the 'find_metapath' call right after the loop will break (because it assumes 
> that
> 'height' parameter will not be larger than the size of metapath's mp_list 
> array length,
> which is GFS2_MAX_META_HEIGHT). So, we need to check the 'height' after the 
> loop ending,
> and if its value is too big we need to break the execution of the function, 
> and return
> a proper error if it is too big.

Thanks for the patch, but the actual problem here is that we're
starting out with an invalid height; when starting with a valid
height, the loop will always terminate. Here's a proper fix:

https://listman.redhat.com/archives/cluster-devel/2023-March/023644.html

While looking into this, I had difficulties getting prepro.c to work.
The reason is that it always uses /dev/loop0 instead of creating its
own loop device. Here's a partial fix for that (but note that this
doesn't include the necessary cleanup code at the end):

--- a/repro.c
+++ b/repro.c
@@ -26,8 +26,6 @@
 #define __NR_memfd_create 319
 #endif

-static unsigned long long procid;
-
 //% This code is derived from puff.{c,h}, found in the zlib development. The
 //% original files come with the following copyright notice:

@@ -423,8 +421,15 @@ static long syz_mount_image(volatile long fsarg,
volatile long dir,
   char* source = NULL;
   char loopname[64];
   if (need_loop_device) {
+int loopctlfd;
+long devnr;
+
+loopctlfd = open("/dev/loop-control", O_RDWR);
+devnr = ioctl(loopctlfd, LOOP_CTL_GET_FREE);
+close(loopctlfd);
+
 memset(loopname, 0, sizeof(loopname));
-snprintf(loopname, sizeof(loopname), "/dev/loop%llu", procid);
+snprintf(loopname, sizeof(loopname), "/dev/loop%lu", devnr);
 if (setup_loop_device(data, size, loopname, ) == -1)
   return -1;
 source = loopname;

Thanks,
Andreas

> Tested via syzbot.
>
> Reported-by: syzbot+45d4691b1ed3c48eb...@syzkaller.appspotmail.com
> Link: 
> https://syzkaller.appspot.com/bug?id=42296ea544c870f4dde3b25efa4cc1b89515d38e
> Signed-off-by: Ivan Orlov 
> ---
>  fs/gfs2/bmap.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index eedf6926c652..9b7358165f96 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -895,8 +895,16 @@ static int __gfs2_iomap_get(struct inode *inode, 

[Cluster-devel] [PATCH] gfs2: Fix inode height consistency check

2023-03-27 Thread Andreas Gruenbacher
The maximum allowed height of an inode's metadata tree depends on the
filesystem block size; it is lower for bigger-block filesystems.  When
reading in an inode, make sure that the height doesn't exceed the
maximum allowed height.

Arrays like sd_heightsize are sized to be big enough for any filesystem
block size; they will often be slightly bigger than what's needed for a
specific filesystem.

Reported-by: syzbot+45d4691b1ed3c48eb...@syzkaller.appspotmail.com
Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/glops.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 4d99cc77a29b..b65950e76be5 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -396,6 +396,7 @@ static int inode_go_demote_ok(const struct gfs2_glock *gl)
 
 static int gfs2_dinode_in(struct gfs2_inode *ip, const void *buf)
 {
+   struct gfs2_sbd *sdp = GFS2_SB(>i_inode);
const struct gfs2_dinode *str = buf;
struct timespec64 atime;
u16 height, depth;
@@ -442,7 +443,7 @@ static int gfs2_dinode_in(struct gfs2_inode *ip, const void 
*buf)
/* i_diskflags and i_eattr must be set before gfs2_set_inode_flags() */
gfs2_set_inode_flags(inode);
height = be16_to_cpu(str->di_height);
-   if (unlikely(height > GFS2_MAX_META_HEIGHT))
+   if (unlikely(height > sdp->sd_max_height))
goto corrupt;
ip->i_height = (u8)height;
 
-- 
2.39.2



Re: [Cluster-devel] [PATCH 0/3] gfs2_(un)link cleanups

2023-03-27 Thread Andreas Gruenbacher
On Tue, Mar 14, 2023 at 2:18 PM Andrew Price  wrote:
> Some trivial cleanups from my O_TMPFILE branch. That work isn't ready
> yet but there was no reason not to send these patches.

Applied, thanks.

Andreas



[Cluster-devel] Andy gfs2 patches?

2023-03-23 Thread Andreas Gruenbacher


[Cluster-devel] [GIT PULL] gfs2 fix for v6.3-rc4

2023-03-23 Thread Andreas Gruenbacher
From: Linus Torvalds 

Hi Linus,

please consider pulling the following fix.

Thanks,
Andreas

The following changes since commit e8d018dd0257f744ca50a729e3d042cf2ec9da65:

  Linux 6.3-rc3 (2023-03-19 13:27:55 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
gfs2-v6.3-rc3-fix

for you to fetch changes up to 260595b439776c473cc248f0de63fe78d964d849:

  Reinstate "GFS2: free disk inode which is deleted by remote node -V2" 
(2023-03-23 19:37:56 +0100)


gfs2 fix

- Reinstate commit 970343cd4904 ("GFS2: free disk inode which is deleted
  by remote node -V2") as reverting that commit could cause
  gfs2_put_super() to hang.


Bob Peterson (1):
  Reinstate "GFS2: free disk inode which is deleted by remote node -V2"

 fs/gfs2/dentry.c | 18 ++
 1 file changed, 18 insertions(+)



Re: [Cluster-devel] [PATCH] dlm_controld: set posix_lock flags to zero

2023-03-21 Thread Andreas Gruenbacher
On Tue, Mar 21, 2023 at 2:17 PM Alexander Aring  wrote:
> This patch sets another flags variable to zero which is a leftover of
> commit 0834ed4b ("dlm_controld: initialize waiter->flags").
> ---
>  dlm_controld/plock.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/dlm_controld/plock.c b/dlm_controld/plock.c
> index ad9b0f27..7f632888 100644
> --- a/dlm_controld/plock.c
> +++ b/dlm_controld/plock.c
> @@ -1956,6 +1956,7 @@ void receive_plocks_data(struct lockspace *ls, struct 
> dlm_header *hd, int len)
> po->pid = le32_to_cpu(pp->pid);
> po->nodeid  = le32_to_cpu(pp->nodeid);
> po->ex  = pp->ex;
> +   po->flags   = 0;
> list_add_tail(>list, >locks);
> } else {
> w = malloc(sizeof(struct lock_waiter));

^ And I think w->flags also isn't being initialized.

Andreas



Re: [Cluster-devel] [PATCH 1/4] fs: add i_blocksize_mask()

2023-03-09 Thread Andreas Gruenbacher
On Thu, Mar 9, 2023 at 10:43 AM Yangtao Li  wrote:
> Introduce i_blocksize_mask() to simplify code, which replace
> (i_blocksize(node) - 1). Like done in commit
> 93407472a21b("fs: add i_blocksize()").

I would call this i_blockmask().

Note that it could be used in ocfs2_is_io_unaligned() as well.

>
> Signed-off-by: Yangtao Li 
> ---
>  include/linux/fs.h | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index c85916e9f7db..db335bd9c256 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -711,6 +711,11 @@ static inline unsigned int i_blocksize(const struct 
> inode *node)
> return (1 << node->i_blkbits);
>  }
>
> +static inline unsigned int i_blocksize_mask(const struct inode *node)
> +{
> +   return i_blocksize(node) - 1;
> +}
> +
>  static inline int inode_unhashed(struct inode *inode)
>  {
> return hlist_unhashed(>i_hash);
> --
> 2.25.1
>

Thanks,
Andreas



Re: [Cluster-devel] return an ERR_PTR from __filemap_get_folio v3

2023-03-07 Thread Andreas Gruenbacher
On Tue, Mar 7, 2023 at 4:07 PM Christoph Hellwig  wrote:
> __filemap_get_folio and its wrappers can return NULL for three different
> conditions, which in some cases requires the caller to reverse engineer
> the decision making.  This is fixed by returning an ERR_PTR instead of
> NULL and thus transporting the reason for the failure.  But to make
> that work we first need to ensure that no xa_value special case is
> returned and thus return the FGP_ENTRY flag.  It turns out that flag
> is barely used and can usually be deal with in a better way.

Thanks for working on this cleanup; looking good at a first glance.

Andreas



Re: [Cluster-devel] [PATCH dlm-tool 13/14] dlm_controld: plock log lock state

2023-03-03 Thread Andreas Gruenbacher
On Fri, Mar 3, 2023 at 4:53 PM Andreas Gruenbacher  wrote:
> diff --git a/dlm_controld/plock.c b/dlm_controld/plock.c
> index 39bdd1f6..588bcaaa 100644
> --- a/dlm_controld/plock.c
> +++ b/dlm_controld/plock.c
> @@ -8,6 +8,7 @@
>
>  #include "dlm_daemon.h"
>  #include 
> +#include 
>
>  /* FIXME: remove this once everyone is using the version of
>   * dlm_plock.h which defines it */
> @@ -211,6 +212,11 @@ static uint64_t dt_usec(const struct timeval *start, 
> const struct timeval *stop)
>  static void plock_print_start_waiter(const struct lockspace *ls,
>  struct lock_waiter *w)
>  {
> +   const struct dlm_plock_info *info = >info;
> +
> +   DTRACE_PROBE7(dlm_controld, plock_wait_begin, info->number, w, 
> info->start,
> + info->end, info->nodeid, info->pid, info->owner);
> +
> log_plock(ls, "state waiter start %llx %p %llx-%llx %d/%u/%llx",
>   (unsigned long long)w->info.number,
>   w,

An additional question I have about those events is which information
to log. We need to be able to identify which inode the request is for
(info->number), the locking range (info->start and info->end), whether
it is read or write lock, and which context in the kernel the request
refers to (this seems to be info->owner, but I'm not entirely sure
about that). The pid may be interesting as well, but are w or
info->nodeid really useful? We should try to avoid exposing
unnecessary implementation details like the addresses of objects
inside dlm_controld.

> @@ -223,6 +229,10 @@ static void plock_print_start_waiter(const struct 
> lockspace *ls,
>  static void plock_print_end_waiter(const struct lockspace *ls,
>const struct lock_waiter *w)
>  {
> +   const struct dlm_plock_info *info = >info;
> +
> +   DTRACE_PROBE2(dlm_controld, plock_wait_end, info->number, w);
> +
> log_plock(ls, "state waiter end %llx %p",
>   (unsigned long long)w->info.number, w);
>  }
> @@ -230,6 +240,9 @@ static void plock_print_end_waiter(const struct lockspace 
> *ls,
>  static void plock_print_start_plock(const struct lockspace *ls, uint64_t 
> number,
> const struct posix_lock *po)
>  {
> +   DTRACE_PROBE8(dlm_controld, plock_lock_begin, number, po, po->ex, 
> po->start,
> + po->end, po->nodeid, po->pid, po->owner);
> +
> log_plock(ls, "state plock start %llx %p %s %llx-%llx %d/%u/%llx",
>   (unsigned long long)number,
>   po,
> @@ -243,6 +256,8 @@ static void plock_print_start_plock(const struct 
> lockspace *ls, uint64_t number,
>  static void plock_print_end_plock(const struct lockspace *ls, uint64_t 
> number,
>   const struct posix_lock *po)
>  {
> +   DTRACE_PROBE1(dlm_controld, pock_lock_end, po);
> +
> log_plock(ls, "state plock end %llx %p",
>   (unsigned long long)number, po);
>  }
> --
> 2.39.0
>

Thanks,
Andreas



Re: [Cluster-devel] [PATCH dlm-tool 13/14] dlm_controld: plock log lock state

2023-03-03 Thread Andreas Gruenbacher
Now, let me get to the core of the matter.  We've been talking about
using user-space (SDT) trace points for collecting the data, and I still
think that that's what we should do instead of introducing a new
dlm_controld log file.  In the dlm_controld code, this would look like
the below patch.

Note that  is part of the systemtap-sdt-devel package, so a
"BuildRequires: systemtap-sdt-devel" dependency will be needed in
dlm.spec.

With that, we can use standard tools like perf, bpftrace, etc. for
collecting all the relevant information without any further
modifications to dlm_controld.  We can also collect additional kernel
and user-space trace point data at the same time with very little
additional effort.

For example, here is how to register the four plock dlm_controld trace
points in perf:

  for ev in \
  sdt_dlm_controld:plock_lock_begin \
  sdt_dlm_controld:plock_lock_end \
  sdt_dlm_controld:plock_wait_begin \
  sdt_dlm_controld:plock_wait_end; do \
perf probe -x /usr/sbin/dlm_controld $ev; \
  done

The events can then be recorded with "perf record":

  perf record \
-e sdt_dlm_controld:plock_lock_begin \
-e sdt_dlm_controld:plock_lock_end \
-e sdt_dlm_controld:plock_wait_begin \
-e sdt_dlm_controld:plock_wait_end \
[...]

We've already gone through how the resulting log can be processed with
"perf script".  One possible result would be the log file format that
lockdb_plot expects, but there are countless other possibilities.

Other useful "tricks":

  $ bpftrace -l 'usdt:/usr/sbin/dlm_controld:*'

  $ readelf -n /usr/sbin/dlm_controld | sed -ne '/\.note\.stapsdt/,/^$/p'


Thanks,
Andreas

-- 
 dlm_controld/plock.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/dlm_controld/plock.c b/dlm_controld/plock.c
index 39bdd1f6..588bcaaa 100644
--- a/dlm_controld/plock.c
+++ b/dlm_controld/plock.c
@@ -8,6 +8,7 @@
 
 #include "dlm_daemon.h"
 #include 
+#include 
 
 /* FIXME: remove this once everyone is using the version of
  * dlm_plock.h which defines it */
@@ -211,6 +212,11 @@ static uint64_t dt_usec(const struct timeval *start, const 
struct timeval *stop)
 static void plock_print_start_waiter(const struct lockspace *ls,
 struct lock_waiter *w)
 {
+   const struct dlm_plock_info *info = >info;
+
+   DTRACE_PROBE7(dlm_controld, plock_wait_begin, info->number, w, 
info->start,
+ info->end, info->nodeid, info->pid, info->owner);
+
log_plock(ls, "state waiter start %llx %p %llx-%llx %d/%u/%llx",
  (unsigned long long)w->info.number,
  w,
@@ -223,6 +229,10 @@ static void plock_print_start_waiter(const struct 
lockspace *ls,
 static void plock_print_end_waiter(const struct lockspace *ls,
   const struct lock_waiter *w)
 {
+   const struct dlm_plock_info *info = >info;
+
+   DTRACE_PROBE2(dlm_controld, plock_wait_end, info->number, w);
+
log_plock(ls, "state waiter end %llx %p",
  (unsigned long long)w->info.number, w);
 }
@@ -230,6 +240,9 @@ static void plock_print_end_waiter(const struct lockspace 
*ls,
 static void plock_print_start_plock(const struct lockspace *ls, uint64_t 
number,
const struct posix_lock *po)
 {
+   DTRACE_PROBE8(dlm_controld, plock_lock_begin, number, po, po->ex, 
po->start,
+ po->end, po->nodeid, po->pid, po->owner);
+
log_plock(ls, "state plock start %llx %p %s %llx-%llx %d/%u/%llx",
  (unsigned long long)number,
  po,
@@ -243,6 +256,8 @@ static void plock_print_start_plock(const struct lockspace 
*ls, uint64_t number,
 static void plock_print_end_plock(const struct lockspace *ls, uint64_t number,
  const struct posix_lock *po)
 {
+   DTRACE_PROBE1(dlm_controld, pock_lock_end, po);
+
log_plock(ls, "state plock end %llx %p",
  (unsigned long long)number, po);
 }
-- 
2.39.0



Re: [Cluster-devel] [PATCH dlm-tool 13/14] dlm_controld: plock log lock state

2023-03-03 Thread Andreas Gruenbacher
Alexx,

can you please prefix this patch with the following to make this easier
to read?

Thanks,
Andreas

--

dlm_controld: pass lockspace and lock number to add_lock()

The next patch will make use of the additional arguments.
---
 dlm_controld/plock.c | 41 ++---
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/dlm_controld/plock.c b/dlm_controld/plock.c
index b93863f7..6709d205 100644
--- a/dlm_controld/plock.c
+++ b/dlm_controld/plock.c
@@ -466,8 +466,9 @@ static int is_conflict(struct resource *r, struct 
dlm_plock_info *in, int get)
return 0;
 }
 
-static int add_lock(struct resource *r, uint32_t nodeid, uint64_t owner,
-   uint32_t pid, int ex, uint64_t start, uint64_t end)
+static int add_lock(const struct lockspace *ls, struct resource *r,
+   uint32_t nodeid, uint64_t owner, uint32_t pid,
+   int ex, uint64_t start, uint64_t end, uint64_t number)
 {
struct posix_lock *po;
 
@@ -491,8 +492,8 @@ static int add_lock(struct resource *r, uint32_t nodeid, 
uint64_t owner,
1. add new lock for non-overlap area of RE, orig mode
2. convert RE to RN range and mode */
 
-static int lock_case1(struct posix_lock *po, struct resource *r,
- struct dlm_plock_info *in)
+static int lock_case1(const struct lockspace *ls, struct posix_lock *po,
+ struct resource *r, struct dlm_plock_info *in)
 {
uint64_t start2, end2;
int rv;
@@ -508,7 +509,8 @@ static int lock_case1(struct posix_lock *po, struct 
resource *r,
po->end = in->end;
po->ex = in->ex;
 
-   rv = add_lock(r, in->nodeid, in->owner, in->pid, !in->ex, start2, end2);
+   rv = add_lock(ls, r, in->nodeid, in->owner, in->pid, !in->ex, start2,
+ end2, in->number);
  out:
return rv;
 }
@@ -518,19 +520,20 @@ static int lock_case1(struct posix_lock *po, struct 
resource *r,
2. add new lock for back fragment, orig mode
3. convert RE to RN range and mode */
 
-static int lock_case2(struct posix_lock *po, struct resource *r,
- struct dlm_plock_info *in)
+static int lock_case2(const struct lockspace *ls, struct posix_lock *po,
+ struct resource *r, struct dlm_plock_info *in)
 
 {
int rv;
 
-   rv = add_lock(r, in->nodeid, in->owner, in->pid,
- !in->ex, po->start, in->start - 1);
+   rv = add_lock(ls, r, in->nodeid, in->owner, in->pid,
+ !in->ex, po->start, in->start - 1,
+ in->number);
if (rv)
goto out;
 
-   rv = add_lock(r, in->nodeid, in->owner, in->pid,
- !in->ex, in->end + 1, po->end);
+   rv = add_lock(ls, r, in->nodeid, in->owner, in->pid,
+ !in->ex, in->end + 1, po->end, in->number);
if (rv)
goto out;
 
@@ -569,14 +572,14 @@ static int lock_internal(struct lockspace *ls, struct 
resource *r,
if (po->ex == in->ex)
goto out;
 
-   rv = lock_case1(po, r, in);
+   rv = lock_case1(ls, po, r, in);
goto out;
 
case 2:
if (po->ex == in->ex)
goto out;
 
-   rv = lock_case2(po, r, in);
+   rv = lock_case2(ls, po, r, in);
goto out;
 
case 3:
@@ -597,8 +600,8 @@ static int lock_internal(struct lockspace *ls, struct 
resource *r,
}
}
 
-   rv = add_lock(r, in->nodeid, in->owner, in->pid,
- in->ex, in->start, in->end);
+   rv = add_lock(ls, r, in->nodeid, in->owner, in->pid,
+ in->ex, in->start, in->end, in->number);
  out:
return rv;
 
@@ -638,8 +641,8 @@ static int unlock_internal(struct lockspace *ls, struct 
resource *r,
/* RN within RE - shrink and update RE to be front
 * fragment, and add a new lock for back fragment */
 
-   rv = add_lock(r, in->nodeid, in->owner, in->pid,
- po->ex, in->end + 1, po->end);
+   rv = add_lock(ls, r, in->nodeid, in->owner, in->pid,
+ po->ex, in->end + 1, po->end, in->number);
po->end = in->start - 1;
goto out;
 
@@ -1346,8 +1349,8 @@ static void _receive_sync(struct lockspace *ls, struct 
dlm_header *hd, int len)
}
 
if (hd->type == DLM_MSG_PLOCK_SYNC_LOCK)
-   add_lock(r, info.nodeid, info.owner, info.pid, info.ex, 
-info.start, info.end);
+   add_lock(ls, r, info.nodeid, info.owner, info.pid, info.ex,
+info.start, info.end, 

[Cluster-devel] [GIT PULL] gfs2 fixes

2023-02-22 Thread Andreas Gruenbacher
Hi Linus,

please consider pulling the following gfs2 fixes.

These fixes are based on the latest iomap-for-next branch, for which
Darrick has sent a pull request earlier today:

https://lore.kernel.org/linux-fsdevel/167703901677.1909640.1798642413122202835.stg-ugh@magnolia/

Thanks,
Andreas


The following changes since commit 63510d9f2f6e6337960499a3d72d5a457b19c287:

  Merge branch 'iomap-for-next' of 
git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git (2023-01-24 12:51:39 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
tags/gfs2-v6.2-rc5-fixes

for you to fetch changes up to c1b0c3cfcbad25d2c412863c27638c933f1d911b:

  gfs2: Convert gfs2_page_add_databufs to folios (2023-02-22 12:06:20 +0100)


gfs2 fixes

- Fix a race when disassociating inodes from their glocks after
  iget_failed().

- On filesystems with a block size smaller than the page size, make
  sure that ->writepages() writes out all buffers of journaled inodes.

- Various improvements to the way the delete workqueue is drained to
  speed up unmount and prevent leftover inodes.  At unmount time, evict
  deleted inodes cooperatively across the cluster to avoid unnecessary
  timeouts.

- Various minor cleanups and fixes.

----
Andreas Gruenbacher (12):
  gfs2: gl_object races fix
  gfs2: Improve gfs2_upgrade_iopen_glock comment
  gfs2: Clean up gfs2_scan_glock_lru
  gfs2: Make glock lru list scanning safer
  gfs2: Get rid of GLF_PENDING_DELETE flag
  gfs2: Move delete workqueue into super block
  gfs2: Split the two kinds of glock "delete" work
  gfs2: Flush delete work before shrinking inode cache
  gfs2: Evict inodes cooperatively
  gfs2: Improve gfs2_make_fs_rw error handling
  gfs2: jdata writepage fix
  gfs2: Convert gfs2_page_add_databufs to folios

Bob Peterson (4):
  gfs2: check gl_object in rgrp glops
  gfs2: Add SDF_DEACTIVATING super block flag
  gfs2: Cease delete work during unmount
  Revert "GFS2: free disk inode which is deleted by remote node -V2"

 fs/gfs2/aops.c   |   9 ++--
 fs/gfs2/aops.h   |   4 +-
 fs/gfs2/bmap.c   |   4 +-
 fs/gfs2/dentry.c |  18 
 fs/gfs2/glock.c  | 128 +--
 fs/gfs2/glock.h  |   4 +-
 fs/gfs2/glops.c  |  21 -
 fs/gfs2/incore.h |  11 -
 fs/gfs2/inode.c  |   8 
 fs/gfs2/ops_fstype.c |  71 +++-
 fs/gfs2/rgrp.c   |   2 +-
 fs/gfs2/super.c  |  49 ++--
 fs/gfs2/sys.c|   2 +
 13 files changed, 204 insertions(+), 127 deletions(-)



[Cluster-devel] [PATCH] gfs2: Improve gfs2_make_fs_rw error handling

2023-01-31 Thread Andreas Gruenbacher
In gfs2_make_fs_rw(), make sure to call gfs2_consist() to report an
inconsistency and mark the filesystem as withdrawn when
gfs2_find_jhead() fails.

At the end of gfs2_make_fs_rw(), when we discover that the filesystem
has been withdrawn, make sure we report an error.  This also replaces
the gfs2_withdrawn() check after gfs2_find_jhead().

Reported-by: Tetsuo Handa 
Cc: syzbot+f51cb4b9afbd87ec0...@syzkaller.appspotmail.com
Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/super.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 52c3502de58c..a83fa62106f0 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -138,8 +138,10 @@ int gfs2_make_fs_rw(struct gfs2_sbd *sdp)
return -EIO;
 
error = gfs2_find_jhead(sdp->sd_jdesc, , false);
-   if (error || gfs2_withdrawn(sdp))
+   if (error) {
+   gfs2_consist(sdp);
return error;
+   }
 
if (!(head.lh_flags & GFS2_LOG_HEAD_UNMOUNT)) {
gfs2_consist(sdp);
@@ -151,7 +153,9 @@ int gfs2_make_fs_rw(struct gfs2_sbd *sdp)
gfs2_log_pointers_init(sdp, head.lh_blkno);
 
error = gfs2_quota_init(sdp);
-   if (!error && !gfs2_withdrawn(sdp))
+   if (!error && gfs2_withdrawn(sdp))
+   error = -EIO;
+   if (!error)
set_bit(SDF_JOURNAL_LIVE, >sd_flags);
return error;
 }
-- 
2.39.0



Re: [Cluster-devel] [RFC v6 05/10] iomap/gfs2: Get page in page_prepare handler

2023-01-31 Thread Andreas Gruenbacher
On Tue, Jan 31, 2023 at 8:37 PM Matthew Wilcox  wrote:
> On Sun, Jan 08, 2023 at 08:40:29PM +0100, Andreas Gruenbacher wrote:
> > +static struct folio *
> > +gfs2_iomap_page_prepare(struct iomap_iter *iter, loff_t pos, unsigned len)
> >  {
> > + struct inode *inode = iter->inode;
> >   unsigned int blockmask = i_blocksize(inode) - 1;
> >   struct gfs2_sbd *sdp = GFS2_SB(inode);
> >   unsigned int blocks;
> > + struct folio *folio;
> > + int status;
> >
> >   blocks = ((pos & blockmask) + len + blockmask) >> inode->i_blkbits;
> > - return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> > + status = gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
> > + if (status)
> > + return ERR_PTR(status);
> > +
> > + folio = iomap_get_folio(iter, pos);
> > + if (IS_ERR(folio))
> > + gfs2_trans_end(sdp);
> > + return folio;
> >  }
>
> Hi Andreas,

Hello,

> I didn't think to mention this at the time, but I was reading through
> buffered-io.c and this jumped out at me.  For filesystems which support
> folios, we pass the entire length of the write (or at least the length
> of the remaining iomap length).  That's intended to allow us to decide
> how large a folio to allocate at some point in the future.
>
> For GFS2, we do this:
>
> if (!mapping_large_folio_support(iter->inode->i_mapping))
> len = min_t(size_t, len, PAGE_SIZE - offset_in_page(pos));
>
> I'd like to drop that and pass the full length of the write to
> ->get_folio().  It looks like you'll have to clamp it yourself at this
> point.

sounds reasonable to me.

I see that gfs2_page_add_databufs() hasn't been folio-ized yet, but it
looks like it might just work anway. So gfs2_iomap_get_folio() ...
gfs2_iomap_put_folio() should, in principle, work for requests bigger
than PAGE_SIZE.

Is there a reasonable way of trying it out?

We still want to keep the transaction size somewhat reasonable, but
the maximum size gfs2_iomap_begin() will return for a write is 509
blocks on a 4k-block filesystem, or slightly less than 2 MiB, which
should be fine.

>  I am kind of curious why you do one transaction per page --
> I would have thought you'd rather do one transaction for the entire write.

Only for journaled data writes. We could probably do bigger
transactions even in that case, but we'd rather get rid of data
journaling than encourage it, so we're also not spending a lot of time
on optimizing this case.

Thanks,
Andreas



Re: [Cluster-devel] [syzbot] INFO: task hung in freeze_super (3)

2023-01-31 Thread Andreas Gruenbacher
Hello,

On Wed, Jan 4, 2023 at 3:37 PM Tetsuo Handa
 wrote:
> syzbot is reporting hung task at freeze_super() after emitting
> "gfs2: fsid=loop0.0: can't make FS RW: -5" message due to gfs2_find_jhead()
>  from gfs2_make_fs_rw() from gfs2_fill_super() failing with -EIO.
>
> When hung task message is printed, I can observe that glock_workqueue is
> trying to hold type->s_umount_key from glock_workqueue work function
> whereas the reproducer is trying to flush glock_workqueue workqueue with
> type->s_umount_key held; leading to silent deadlock.
>
> [  259.867348] task:kworker/3:1Hstate:D stack:13736 pid:2497  ppid:2  
> flags:0x4000
> [  259.870666] Workqueue: glock_workqueue glock_work_func
> [  259.872886] Call Trace:
> [  259.874279]  
> [  259.875452]  __schedule+0x498/0x590
> [  259.877255]  schedule+0x55/0x90
> [  259.878932]  rwsem_down_write_slowpath+0x385/0x760
> [  259.881206]  freeze_super+0x29/0x1f0
> [  259.883022]  freeze_go_sync+0xa6/0x1f0
> [  259.884905]  do_xmote+0x1ae/0xa60
> [  259.886625]  glock_work_func+0x19a/0x220
> [  259.888559]  process_one_work+0x306/0x530
> [  259.890548]  worker_thread+0x357/0x630
> [  259.892443]  kthread+0x133/0x170
> [  259.894430]  ? rcu_lock_release+0x30/0x30
> [  259.896408]  ? kthread_blkcg+0x60/0x60
> [  259.898289]  ret_from_fork+0x1f/0x30
> [  259.900114]  
> [  259.901407] 3 locks held by kworker/3:1H/2497:
> [  259.903534]  #0: 88810653c338 
> ((wq_completion)glock_workqueue){+.+.}-{0:0}, at: process_one_work+0x2a7/0x530
> [  259.907874]  #1: c900013dbe58 
> ((work_completion)(&(>gl_work)->work)){+.+.}-{0:0}, at: 
> process_one_work+0x2d1/0x530
> [  259.912529]  #2: 88812be0d0e0 (>s_umount_key#51){+.+.}-{3:3}, 
> at: freeze_super+0x29/0x1f0
>
> [  259.975640] task:a.out   state:D stack:12664 pid:4579  ppid:3638   
> flags:0x4006
> [  259.979256] Call Trace:
> [  259.980634]  
> [  259.981886]  __schedule+0x498/0x590
> [  259.983653]  schedule+0x55/0x90
> [  259.985291]  schedule_timeout+0x9e/0x1d0
> [  259.987219]  do_wait_for_common+0xff/0x180
> [  259.989212]  ? console_conditional_schedule+0x40/0x40
> [  259.991585]  wait_for_completion+0x4a/0x60
> [  259.993573]  __flush_workqueue+0x360/0x820
> [  259.995584]  gfs2_gl_hash_clear+0x58/0x1b0
> [  259.997579]  ? _raw_spin_unlock_irqrestore+0x43/0xb0
> [  259.999883]  ? lockdep_hardirqs_on+0x99/0x140
> [  260.001970]  gfs2_fill_super+0xe24/0x1110
> [  260.003934]  ? gfs2_reconfigure+0x4d0/0x4d0
> [  260.005955]  get_tree_bdev+0x228/0x2f0
> [  260.007821]  gfs2_get_tree+0x2b/0xe0
> [  260.009620]  vfs_get_tree+0x30/0xe0
> [  260.011570]  do_new_mount+0x1d7/0x540
> [  260.013408]  path_mount+0x3c5/0xb50
> [  260.015176]  __se_sys_mount+0x298/0x2f0
> [  260.017073]  do_syscall_64+0x41/0x90
> [  260.018870]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [  260.021262] RIP: 0033:0x7fc273f26eae
> [  260.023077] RSP: 002b:7ebc6f38 EFLAGS: 0282 ORIG_RAX: 
> 00a5
> [  260.026419] RAX: ffda RBX:  RCX: 
> 7fc273f26eae
> [  260.029573] RDX: 200124c0 RSI: 20012500 RDI: 
> 7ebc6fb0
> [  260.032728] RBP: 7ebc7100 R08: 7ebc6ff0 R09: 
> 
> [  260.035903] R10:  R11: 0282 R12: 
> 7ebc7278
> [  260.039063] R13: 559922083246 R14: 559922097cb8 R15: 
> 7fc27413b040
> [  260.042241]  
> [  260.043522] 1 lock held by a.out/4579:
> [  260.045379]  #0: 88812be0d0e0 (>s_umount_key#50/1){+.+.}-{3:3}, 
> at: alloc_super+0x102/0x450
>
> It is difficult to reproduce lockdep warning because this is timing dependent.
> So far I was able to reproduce lockdep warning only once.
>
> [  111.928183] [ T4537] gfs2: fsid=loop0.0: can't make FS RW: -5
> [  111.931578] [   T88]
> [  111.937591] [   T88] ==
> [  111.940620] [   T88] WARNING: possible circular locking dependency detected
> [  111.943553] [   T88] 6.2.0-rc1-next-20221226-2-gc99a3711d8e3-dirty #28 
> Not tainted
> [  111.946879] [   T88] --
> [  111.949832] [   T88] kworker/2:1H/88 is trying to acquire lock:
> [  111.952383] [   T88] 888128ff50e0 
> (>s_umount_key#51){+.+.}-{3:3}, at: freeze_super+0x29/0x1f0
> [  111.956324] [   T88]
> [  111.956324] [   T88] but task is already holding lock:
> [  111.959406] [   T88] c9bfbe58 
> ((work_completion)(&(>gl_work)->work)){+.+.}-{0:0}, at: 
> process_one_work+0x2d1/0x530
> [  111.964081] [   T88]
> [  111.964081] [   T88] which lock already depends on the new lock.
> [  111.964081] [   T88]
> [  111.968219] [   T88]
> [  111.968219] [   T88] the existing dependency chain (in reverse order) is:
> [  111.971959] [   T88]
> [  111.971959] [   T88] -> #2 
> ((work_completion)(&(>gl_work)->work)){+.+.}-{0:0}:
> [  111.976006] [   T88]process_one_work+0x2f3/0x530
> [  111.978307] [   T88]worker_thread+0x357/0x630

Re: [Cluster-devel] [PATCH] gfs2: Fix uaf for qda in gfs2_quota_sync

2023-01-30 Thread Andreas Gruenbacher
Hello Edward,

On Fri, Jan 27, 2023 at 6:12 AM  wrote:
> From: Edward Adam Davis 
>
> [   81.372851][ T5532] CPU: 1 PID: 5532 Comm: syz-executor.0 Not tainted 
> 6.2.0-rc1-syzkaller-dirty #0
> [   81.382080][ T5532] Hardware name: Google Google Compute Engine/Google 
> Compute Engine, BIOS Google 01/12/2023
> [   81.392343][ T5532] Call Trace:
> [   81.395654][ T5532]  
> [   81.398603][ T5532]  dump_stack_lvl+0x1b1/0x290
> [   81.418421][ T5532]  gfs2_assert_warn_i+0x19a/0x2e0
> [   81.423480][ T5532]  gfs2_quota_cleanup+0x4c6/0x6b0
> [   81.428611][ T5532]  gfs2_make_fs_ro+0x517/0x610
> [   81.457802][ T5532]  gfs2_withdraw+0x609/0x1540
> [   81.481452][ T5532]  gfs2_inode_refresh+0xb2d/0xf60
> [   81.506658][ T5532]  gfs2_instantiate+0x15e/0x220
> [   81.511504][ T5532]  gfs2_glock_wait+0x1d9/0x2a0
> [   81.516352][ T5532]  do_sync+0x485/0xc80
> [   81.554943][ T5532]  gfs2_quota_sync+0x3da/0x8b0
> [   81.559738][ T5532]  gfs2_sync_fs+0x49/0xb0
> [   81.564063][ T5532]  sync_filesystem+0xe8/0x220
> [   81.568740][ T5532]  generic_shutdown_super+0x6b/0x310
> [   81.574112][ T5532]  kill_block_super+0x79/0xd0
> [   81.578779][ T5532]  deactivate_locked_super+0xa7/0xf0
> [   81.584064][ T5532]  cleanup_mnt+0x494/0x520
> [   81.593753][ T5532]  task_work_run+0x243/0x300
> [   81.608837][ T5532]  exit_to_user_mode_loop+0x124/0x150
> [   81.614232][ T5532]  exit_to_user_mode_prepare+0xb2/0x140
> [   81.619820][ T5532]  syscall_exit_to_user_mode+0x26/0x60
> [   81.625287][ T5532]  do_syscall_64+0x49/0xb0
> [   81.629710][ T5532]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [   81.636292][ T5532] RIP: 0033:0x7efdd688d517
> [   81.640728][ T5532] Code: ff ff ff f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 
> 44 00 00 31 f6 e9 09 00 00 00 66 0f 1f 84 00 00 00 00 00 b8 a6 00 00 00 0f 05 
> <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> [   81.660550][ T5532] RSP: 002b:7fff34520ce8 EFLAGS: 0246 ORIG_RAX: 
> 00a6
> [   81.669413][ T5532] RAX:  RBX:  RCX: 
> 7efdd688d517
> [   81.677403][ T5532] RDX: 7fff34520db9 RSI: 000a RDI: 
> 7fff34520db0
> [   81.685388][ T5532] RBP: 7fff34520db0 R08:  R09: 
> 7fff34520b80
> [   81.695973][ T5532] R10: 55ca38b3 R11: 0246 R12: 
> 7efdd68e6b24
> [   81.704152][ T5532] R13: 7fff34521e70 R14: 55ca3810 R15: 
> 7fff34521eb0
> [   81.712868][ T5532]  
>
> The function "gfs2_quota_cleanup()" may be called in the function "do_sync()",
> This will cause the qda obtained in the function "qd_check_sync" to be 
> released, resulting in the occurrence of uaf.
> In order to avoid this uaf, we can increase the judgment of 
> "sdp->sd_quota_bitmap" released in the function
> "gfs2_quota_cleanup" to confirm that "sdp->sd_quota_list" has been released.

I can see that there is a problem in the gfs2 quota code, but this
unfortunately doesn't look like an acceptable fix. A better approach
would be to use proper reference counting for gfs2_quota_data objects.
In this case, gfs2_quota_sync() is still holding a reference, so the
underlying object shouldn't be freed.

Fixing this properly will require more than a handful of lines.

> Link: https://lore.kernel.org/all/2b5e2405f14e8...@google.com
> Reported-and-tested-by: syzbot+3f6a670108ce43356...@syzkaller.appspotmail.com
> Signed-off-by: Edward Adam Davis 
> ---
>  fs/gfs2/quota.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> index 1ed1722..4cf66bd 100644
> --- a/fs/gfs2/quota.c
> +++ b/fs/gfs2/quota.c
> @@ -1321,6 +1321,9 @@ int gfs2_quota_sync(struct super_block *sb, int type)
> qda[x]->qd_sync_gen =
> sdp->sd_quota_sync_gen;
>
> +   if (!sdp->sd_quota_bitmap)
> +   break;
> +
> for (x = 0; x < num_qd; x++)
> qd_unlock(qda[x]);
> }
> --
> 2.39.0
>

Thanks,
Andreas



[Cluster-devel] [GIT PULL] gfs2 writepage fix

2023-01-22 Thread Andreas Gruenbacher
From: Linus Torvalds 

Hi Linus,

please consider pulling the following gfs2 fix.

Thank you very much,
Andreas

The following changes since commit 5dc4c995db9eb45f6373a956eb1f69460e69e6d4:

  Linux 6.2-rc4 (2023-01-15 09:22:43 -0600)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git 
tags/gfs2-v6.2-rc4-fix

for you to fetch changes up to 95ecbd0f162fc06ef4c4045a66f653f47b62a2d3:

  Revert "gfs2: stop using generic_writepages in gfs2_ail1_start_one" 
(2023-01-22 09:46:14 +0100)


gfs2 writepage fix

- Fix a regression introduced by commit "gfs2: stop using
  generic_writepages in gfs2_ail1_start_one".

--------
Andreas Gruenbacher (1):
  Revert "gfs2: stop using generic_writepages in gfs2_ail1_start_one"

 fs/gfs2/log.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)



Re: [Cluster-devel] [PATCH] Revert "gfs2: stop using generic_writepages in gfs2_ail1_start_one"

2023-01-22 Thread Andreas Gruenbacher
On Sat, Jan 21, 2023 at 3:29 PM Christoph Hellwig  wrote:
>
> > + struct address_space *mapping = data;
> > + int ret = mapping->a_ops->writepage(page, wbc);
> > + mapping_set_error(mapping, ret);
> > + return ret;
>
> I guess beggars can't be choosers, but is there a chance to directly
> call the relevant gfs2 writepage methods here instead of the
> ->writepage call?

Yes, we could wrap struct address_space_operations and move the
writepage method into its wrapper structure relatively easily, but
that would still leave things in a messy state. So I'd really like to
reassess the validity of commit 5ac048bb7ea6 ("GFS2: Use
filemap_fdatawrite() to write back the AIL") before deciding to go
that way.

Also, we're really trying to iterate the list of inodes that are part
of the transaction here, not the list of blocks, and if we stick with
that, an actual list of inodes would help. That would be the
complement of our list of ordered inodes in a sense.

Until then, I'd like to stick with the simplest possible solution
though, which seems to be this.

> Otherwise this looks good:
>
> Acked-by: Christoph Hellwig 

Thanks a lot,
Andreas



[Cluster-devel] [PATCH] Revert "gfs2: stop using generic_writepages in gfs2_ail1_start_one"

2023-01-20 Thread Andreas Gruenbacher
Commit b2b0a5e97855 switched from generic_writepages() to
filemap_fdatawrite_wbc() in gfs2_ail1_start_one() on the path to
replacing ->writepage() with ->writepages() and eventually eliminating
the former.  Function gfs2_ail1_start_one() is called from
gfs2_log_flush(), our main function for flushing the filesystem log.

Unfortunately, at least as implemented today, ->writepage() and
->writepages() are entirely different operations for journaled data
inodes: while the former creates and submits transactions covering the
data to be written, the latter flushes dirty buffers out to disk.

With gfs2_ail1_start_one() now calling ->writepages(), we end up
creating filesystem transactions while we are in the course of a log
flush, which immediately deadlocks on the sdp->sd_log_flush_lock
semaphore.

Work around that by going back to how things used to work before commit
b2b0a5e97855 for now; figuring out a superior solution will take time we
don't have available right now.  However ...

Since the removal of generic_writepages() is imminent, open-code it
here.  We're already inside a blk_start_plug() ...  blk_finish_plug()
section here, so skip that part of the original generic_writepages().

This reverts commit b2b0a5e978552e348f85ad9c7568b630a5ede659.

Signed-off-by: Andreas Gruenbacher 
---
 fs/gfs2/log.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 723639376ae2..61323deb80bc 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -80,6 +80,15 @@ void gfs2_remove_from_ail(struct gfs2_bufdata *bd)
brelse(bd->bd_bh);
 }
 
+static int __gfs2_writepage(struct page *page, struct writeback_control *wbc,
+  void *data)
+{
+   struct address_space *mapping = data;
+   int ret = mapping->a_ops->writepage(page, wbc);
+   mapping_set_error(mapping, ret);
+   return ret;
+}
+
 /**
  * gfs2_ail1_start_one - Start I/O on a transaction
  * @sdp: The superblock
@@ -131,7 +140,7 @@ __acquires(>sd_ail_lock)
if (!mapping)
continue;
spin_unlock(>sd_ail_lock);
-   ret = filemap_fdatawrite_wbc(mapping, wbc);
+   ret = write_cache_pages(mapping, wbc, __gfs2_writepage, 
mapping);
if (need_resched()) {
blk_finish_plug(plug);
cond_resched();
-- 
2.39.0



Re: [Cluster-devel] [PATCH 1/4] gfs2: stop using generic_writepages in gfs2_ail1_start_one

2023-01-18 Thread Andreas Gruenbacher
Hi Christoph,

On Tue, Jul 19, 2022 at 7:07 AM Christoph Hellwig  wrote:
>
> Use filemap_fdatawrite_wbc instead of generic_writepages in
> gfs2_ail1_start_one so that the functin can also cope with address_space
> operations that only implement ->writepages and to properly account
> for cgroup writeback.
>
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Andreas Gruenbacher 
> ---
>  fs/gfs2/log.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
> index f0ee3ff6f9a87..a66e3b1f6d178 100644
> --- a/fs/gfs2/log.c
> +++ b/fs/gfs2/log.c
> @@ -131,7 +131,7 @@ __acquires(>sd_ail_lock)
> if (!mapping)
> continue;
> spin_unlock(>sd_ail_lock);
> -   ret = generic_writepages(mapping, wbc);
> +   ret = filemap_fdatawrite_wbc(mapping, wbc);

this patch unfortunately breaks journaled data inodes.

We're in function gfs2_ail1_start_one() here, which is usually called
via gfs2_log_flush() -> empty_ail1_list() -> gfs2_ail1_start() ->
gfs2_ail1_flush() -> gfs2_ail1_start_one(), and we're going through
the list of buffer heads in the transaction to be flushed. We used to
submit each dirty buffer for I/O individually, but since commit
5ac048bb7ea6 ("GFS2: Use filemap_fdatawrite() to write back the AIL"),
we're submitting all the dirty pages in the metadata address space
this buffer head belongs to. That's slightly bizarre, but it happens
to catch exactly the same buffer heads that are in the transaction, so
we end up with the same result. From what I'm being told, this was
done as a performance optimization -- but nobody actually knows the
details anymore.

The above change means that instead of calling generic_writepages(),
we end up calling filemap_fdatawrite_wbc() -> do_writepages() ->
mapping->a_ops->writepages(). But that's something completely
different; the writepages address space operation operates is outward
facing, while we really only want to write out the dirty buffers /
pages in the underlying address space. In case of journaled data
inodes, gfs2_jdata_writepages() actually ends up trying to create a
filesystem transaction, which immediately hangs because we're in the
middle of a log flush.

So I'm tempted to revert the following two of your commits; luckily
that's independent from the iomap_writepage() removal:

  d3d71901b1ea ("gfs2: remove ->writepage")
  b2b0a5e97855 ("gfs2: stop using generic_writepages in gfs2_ail1_start_one")

I think we could go through iomap_writepages() instead of
generic_writepages() here as well,  but that's for another day.

As far as cgroup writeback goes, this is journal I/O and I don't have
the faintest idea how the accounting for that is even supposed to
work.

> if (need_resched()) {
> blk_finish_plug(plug);
> cond_resched();
> @@ -222,8 +222,7 @@ void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct 
> writeback_control *wbc)
> spin_unlock(>sd_ail_lock);
> blk_finish_plug();
> if (ret) {
> -   gfs2_lm(sdp, "gfs2_ail1_start_one (generic_writepages) "
> -   "returned: %d\n", ret);
> +   gfs2_lm(sdp, "gfs2_ail1_start_one returned: %d\n", ret);
> gfs2_withdraw(sdp);
> }
> trace_gfs2_ail_flush(sdp, wbc, 0);
> --
> 2.30.2
>

Thanks,
Andreas



  1   2   3   4   5   6   7   8   9   10   >