Re: [Cluster-devel] [PATCH] gfs2: Don't free rgrp clone bitmaps until go_inval

2023-05-11 Thread Steven Whitehouse
Hi,


On Wed, 2023-05-10 at 15:08 -0400, Bob Peterson wrote:
> Before this patch, every time an rgrp was synced (go_sync) the
> clone bitmaps were freed. We do not need to free the bitmaps in many
> common cases. For example when demoting the glock from EXCLUSIVE to
> SHARED. This is especially wasteful in cases where we unlink lots of
> files: the rgrps are transitioned to EX, then back to SH multiple
> times as it looks at the dinode allocation states, then frees them,
> but the clones prevent allocations until the files are evicted.
> Subsequent uses often cause the rgrp glock to be transitioned from
> SH to EX and back again in rapid succession.
> 
> In these cases it's proper to sync the rgrp bitmaps to the storage
> media
> but wasteful to free the clones, because the very next unlink needs
> to
> reallocate the clone bitmaps again. So in short, today we have:
> 
> 1. SH->EX (for unlink or other)
> 2. Allocate (kmalloc) a clone bitmap.
> 3. Clear the bits in original bitmap.
> 4. Keep original state in the clone bitmap to prevent re-allocation
>    until the last user closes the file.
> 5. EX->SH
> 6. Sync bitmap to storage media.
> 7. Free the clone bitmaps.
> 8. Go to 1.
> 
> This repeated kmalloc -> kfree -> kmalloc -> kfree is a waste of
> time:

It is a waste of time. However, if the clones are kept around for lots
of rgrps, then that is a waste of space. The question is really what
the correct balance is.

Can we not solve the problem at source and prevent the large number of
lock transitions referred to above? If not then it might be a good plan
to document why that is the case,

Steve.

> We only need to free the clone bitmaps when the glock is invalidated
> (i.e. when transitioning the glock to UN or DF so another node's view
> is consistent.) However, we still need to re-sync the clones with the
> real bitmap. This patch allows rgrp bitmaps to stick around until we
> have an invalidate of the glock. So in short:
> 
> 1. SH->EX (for unlink or other)
> 2. Only the first time, allocate (kmalloc) a clone bitmap.
> 3. Free the bits in original bitmap.
> 4. Keep original state in the clone bitmap to prevent re-allocation
>    until the last user closes the file.
> 5. EX->SH
> 6. Sync bitmap to storage media.
> 7. Go to 1.
> 
> Other transitions, like EX->UN still sync and free the clone bitmaps.
> And, of course, transition from SH->EX cannot have dirty buffers, so
> will not have clone bitmaps.
> 
> Signed-off-by: Bob Peterson 
> ---
>  fs/gfs2/glops.c |  4 +++-
>  fs/gfs2/rgrp.c  | 13 +
>  fs/gfs2/rgrp.h  |  1 +
>  3 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
> index 01d433ed6ce7..58cf2004548e 100644
> --- a/fs/gfs2/glops.c
> +++ b/fs/gfs2/glops.c
> @@ -205,7 +205,8 @@ static int rgrp_go_sync(struct gfs2_glock *gl)
> error = gfs2_rgrp_metasync(gl);
> if (!error)
> error = gfs2_ail_empty_gl(gl);
> -   gfs2_free_clones(rgd);
> +   if (!test_bit(GLF_INVALIDATE_IN_PROGRESS, >gl_flags))
> +   gfs2_sync_clones(rgd);
> return error;
>  }
>  
> @@ -229,6 +230,7 @@ static void rgrp_go_inval(struct gfs2_glock *gl,
> int flags)
>  
> if (!rgd)
> return;
> +   gfs2_free_clones(rgd);
> start = (rgd->rd_addr * bsize) & PAGE_MASK;
> end = PAGE_ALIGN((rgd->rd_addr + rgd->rd_length) * bsize) -
> 1;
> gfs2_rgrp_brelse(rgd);
> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index 3b9b76e980ad..6e212e0eb74e 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -616,6 +616,19 @@ void gfs2_free_clones(struct gfs2_rgrpd *rgd)
> }
>  }
>  
> +void gfs2_sync_clones(struct gfs2_rgrpd *rgd)
> +{
> +   int x;
> +
> +   for (x = 0; x < rgd->rd_length; x++) {
> +   struct gfs2_bitmap *bi = rgd->rd_bits + x;
> +   if (bi->bi_clone)
> +   memcpy(bi->bi_clone + bi->bi_offset,
> +  bi->bi_bh->b_data + bi->bi_offset,
> +  bi->bi_bytes);
> +   }
> +}
> +
>  static void dump_rs(struct seq_file *seq, const struct
> gfs2_blkreserv *rs,
>     const char *fs_id_buf)
>  {
> diff --git a/fs/gfs2/rgrp.h b/fs/gfs2/rgrp.h
> index 00b30cf893af..254188cf2d7b 100644
> --- a/fs/gfs2/rgrp.h
> +++ b/fs/gfs2/rgrp.h
> @@ -32,6 +32,7 @@ extern void gfs2_clear_rgrpd(struct gfs2_sbd *sdp);
>  extern int gfs2_rindex_update(struct gfs2_sbd *sdp);
>  extern void gfs2_free_clones(struct gfs2_rgrpd *rgd);
>  extern int gfs2_rgrp_go_instantiate(struct gfs2_glock *gl);
> +extern void gfs2_sync_clones(struct gfs2_rgrpd *rgd);
>  extern void gfs2_rgrp_brelse(struct gfs2_rgrpd *rgd);
>  
>  extern struct gfs2_alloc *gfs2_alloc_get(struct gfs2_inode *ip);



Re: [Cluster-devel] [PATCH] gfs2: ignore rindex_update failure in dinode_dealloc

2023-05-05 Thread Steven Whitehouse
On Fri, 2023-05-05 at 08:44 +0100, Andrew Price wrote:
> Hi Bob,
> 
> On 04/05/2023 18:43, Bob Peterson wrote:
> > Before this patch function gfs2_dinode_dealloc would abort if it
> > got a
> > bad return code from gfs2_rindex_update. The problem is that it
> > left the
> > dinode in the unlinked (not free) state, which meant subsequent
> > fsck
> > would clean it up and flag an error. That meant some of our QE
> > tests
> > would fail.
> 
> As I understand it the test is an interrupted rename loop workload
> and 
> gfs2_grow at the same time, and the bad return code is -EINTR, right?
> 
> > The sole purpose of gfs2_rindex_update, in this code path, is to
> > read in
> > any newer rgrps added by gfs2_grow. But since this is a delete
> > operation
> > it won't actually use any of those new rgrps. It can really only
> > twiddle
> > the bits from "Unlinked" to "Free" in an existing rgrp. Therefore
> > the
> > error should not prevent the transition from unlinked to free.
> > 
> > This patch makes gfs2_dinode_dealloc ignore the bad return code and
> > proceed with freeing the dinode so the QE tests will not be tripped
> > up.
> 
> Is it really ok to ignore all potential errors here? I wonder if it 
> should just ignore -EINTR (or whichever error the test produces) so
> that 
> it can still fail well for errors like -EIO.
> 
> Cheers,
> Andy
> 
Perhaps the more important question is why there are errors there in
the first place?

Steve.

> > 
> > Signed-off-by: Bob Peterson 
> > ---
> >   fs/gfs2/super.c | 4 +---
> >   1 file changed, 1 insertion(+), 3 deletions(-)
> > 
> > diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> > index d3b5c6278be0..1f23d7845123 100644
> > --- a/fs/gfs2/super.c
> > +++ b/fs/gfs2/super.c
> > @@ -1131,9 +1131,7 @@ static int gfs2_dinode_dealloc(struct
> > gfs2_inode *ip)
> > return -EIO;
> > }
> >   
> > -   error = gfs2_rindex_update(sdp);
> > -   if (error)
> > -   return error;
> > +   gfs2_rindex_update(sdp);
> 
> 
> 
> >   
> > error = gfs2_quota_hold(ip, NO_UID_QUOTA_CHANGE,
> > NO_GID_QUOTA_CHANGE);
> > if (error)
> 



Re: [Cluster-devel] question about gfs2 multiple device support

2023-04-24 Thread Steven Whitehouse
Hi,

On Sat, 2023-04-22 at 09:20 +0800, Wang Yugui wrote:
> Hi,
> 
> Is there some work for gfs2 multiple device support?
> 
Do you mean multiple devices generically, or specifically the md
driver?

> if multiple device support,
> 1, No need of RAID 0/1/5/6 support.
>    nvme SSD is fast enough for single thread write.
I'm not sure I understand this. Multiple device support generally means
at least one of the RAID modes.

> 
> 2, can we limit one journal into one device?
The filesystem always assumes a single device with one or more
journals. If multiple devices are used, that is done at the block
layer, which is below the filesystem layer.

> 
> 3, can we just write lock one device, so better write throughput?
Do you have a specific application in mind? Or certain performance
levels that you need to hit? The write performance will depend a lot on
the I/O pattern, and the underlying device performance. We'll need a
bit more detail to be more specific I'm afraid,

Steve.

> 
> Best Regards
> Wang Yugui (wangyu...@e16-tech.com)
> 2023/04/22
> 
> 



Re: [Cluster-devel] [PATCH dlm-tool] dlm_controld: better uevent filtering

2023-01-16 Thread Steven Whitehouse
Hi,

On Fri, 2023-01-13 at 17:43 -0500, Alexander Aring wrote:
> When I did test with dlm_locktorture module I got several log
> messages
> about:
> 
> uevent message has 3 args: add@/module/dlm_locktorture
> uevent message has 3 args: remove@/module/dlm_locktorture
> 
> which are not expected and not able to parse by dlm_controld
> process_uevent() function, because mismatch of argument counts.
> Debugging it more, I figured out that those uevents are for
> loading/unloading the dlm_locktorture module and there are uevents
> for
> loading and unloading modules which have nothing todo with dlm
> lockspace
> uevent handling.
> 
> The current filter works as:
> 
> if (!strstr(buf, "dlm"))
> 
I think that is the problem. If you want to filter out all events
except those sent by the DLM module, then you need to look at the
variables sent along with the request. Otherwise whatever string you
look for here might appear in some other random request from a
different subsystem. The event variables are much easier to parse than
the actual event string...

See a similar example in decode_uevent(), etc here:

https://github.com/andyprice/gfs2-utils/blob/91c3e9a69ed70d3d522f5b47015da5e5868722ec/group/gfs_controld/main.c

There are probably nicer ways of doing that, than what I did there, but
it makes it is easier to get at the variables that are passed with the
notification, than to try and parse the first line of the response,

Steve.


> for matching the dlm joining/leaving uevent string which looks like:
> 
> offline@/kernel/dlm/locktorture
> 
> to avoid matching with other uevent which has somehow the string
> "dlm"
> in it, we switch to the match "/dlm/" which should match only to dlm
> uevent system events. Uevent uses itself '/' as a separator in the
> hope
> that uevents cannot put a '/' as application data for an event.
> ---
>  dlm_controld/main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/dlm_controld/main.c b/dlm_controld/main.c
> index 7cf6348e..40689c5c 100644
> --- a/dlm_controld/main.c
> +++ b/dlm_controld/main.c
> @@ -704,7 +704,7 @@ static void process_uevent(int ci)
> return;
> }
>  
> -   if (!strstr(buf, "dlm"))
> +   if (!strstr(buf, "/dlm/"))
> return;
>  
> log_debug("uevent: %s", buf);



Re: [Cluster-devel] BUG? racy access to i_diskflags

2022-06-15 Thread Steven Whitehouse
Hi,

On Tue, 2010-08-17 at 13:28 +0900, 홍신 shin hong wrote:
> Hi. I am reporting an issue suspected as racy
> while I read inode_go_lock() at gfs2/glops.c in Linux 2.6.35.
> 
> Since I do not have much background on GFS2, I am not certain
> whether the issue is serious or not. But please examine the issue
> and let me know your opinion.
> 
> It seems that inode_go_lock() accesses gfs2_inode's i_diskflags field
> without any lock held.
> 
> But, as do_gfs2_set_flags() updates gfs2_inode's i_diskflags,
> concurrent executions with with inode_go_lock() might result
> race conditions.
> 
> Could you examine the issue please?
> 
> Sincerely
> Shin Hong

Yes, inode_go_lock() does examine those flags, but the layers above
that call should ensure that it is single threaded in effect. The
setting of flags required a glock is held, and inode_go_lock() would be
called as part of the glock acquisition, and it is single threaded even
if a shared lock is requested, so it will have completed before
do_gfs2_set_flags() is called. Or perhaps I should say, it should have
completed before then unless you have found a code path where that is
not the case?

Steve.




Re: [Cluster-devel] message in syslog: shrink_slab: gfs2_glock_shrink_scan+0x0/0x240 [gfs2] negative objects to delete nr=xxxxxxxxxxxx

2021-12-08 Thread Steven Whitehouse
Hi,

On Wed, 2021-12-08 at 17:50 +0100, Lentes, Bernd wrote:
> Hi,
> 
> i hope this is the right place for asking about GFS2.
> Yesterday one of my two nodes HA-cluster got slower and slower, until
> it was fenced.
> In /var/log/messages i found this message repeated often before the
> system got slower:
> shrink_slab: gfs2_glock_shrink_scan+0x0/0x240 [gfs2] negative objects
> to delete nr=
> What does it mean ? Is it a problem ?
> 
> My Setup:
> SuSE SLES 12 SP5
> Kernel 4.12.14-122.46-default
> Pacemaker 1.1.23
> corosync 2.3.6-9.13.1
> gfs2-utils-3.1.6-1.101.x86_64
> 
> Thanks.
> 
> Bernd
> 

That message in itself is a consequence of (I believe) a race relating
to shrinking of that slab cache, but it is harmless. However the fact
that you've seen it suggests that the system might be short on free
memory. So I would check to see if a process is hogging memory as that
would explain the slowness too,

Steve.




Re: [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion

2021-08-24 Thread Steven Whitehouse
Hi,

On Mon, 2021-08-23 at 17:18 +0200, Andreas Gruenbacher wrote:
> On Mon, Aug 23, 2021 at 10:14 AM Steven Whitehouse <
> swhit...@redhat.com> wrote:
> > On Fri, 2021-08-20 at 17:22 +0200, Andreas Gruenbacher wrote:
> > > On Fri, Aug 20, 2021 at 3:11 PM Bob Peterson  > > >
> > > wrote:
> > [snip]
> > > > You can almost think of this as a performance enhancement. This
> > > > concept
> > > > allows a process to hold a glock for much longer periods of
> > > > time,
> > > > at a
> > > > lower priority, for example, when gfs2_file_read_iter needs to
> > > > hold
> > > > the
> > > > glock for very long-running iterative reads.
> > > 
> > > Consider a process that allocates a somewhat large buffer and
> > > reads
> > > into it in chunks that are not page aligned. The buffer initially
> > > won't be faulted in, so we fault in the first chunk and write
> > > into
> > > it.
> > > Then, when reading the second chunk, we find that the first page
> > > of
> > > the second chunk is already present. We fill it, set the
> > > HIF_MAY_DEMOTE flag, fault in more pages, and clear the
> > > HIF_MAY_DEMOTE
> > > flag. If we then still have the glock (which is very likely), we
> > > resume the read. Otherwise, we return a short result.
> > > 
> > > Thanks,
> > > Andreas
> > > 
> > 
> > If the goal here is just to allow the glock to be held for a longer
> > period of time, but with occasional interruptions to prevent
> > starvation, then we have a potential model for this. There is
> > cond_resched_lock() which does this for spin locks.
> 
> This isn't an appropriate model for what I'm trying to achieve here.
> In the cond_resched case, we know at the time of the cond_resched
> call
> whether or not we want to schedule. If we do, we want to drop the
> spin
> lock, schedule, and then re-acquire the spin lock. In the case we're
> looking at here, we want to fault in user pages. There is no way of
> knowing beforehand if the glock we're currently holding will have to
> be dropped to achieve that. In fact, it will almost never have to be
> dropped. But if it does, we need to drop it straight away to allow
> the
> conflicting locking request to succeed.
> 
> Have a look at how the patch queue uses gfs2_holder_allow_demote()
> and
> gfs2_holder_disallow_demote():
> 
> https://listman.redhat.com/archives/cluster-devel/2021-August/msg00128.html
> https://listman.redhat.com/archives/cluster-devel/2021-August/msg00134.html
> 
> Thanks,
> Andreas
> 

Ah, now I see! Apologies if I've misunderstood the intent here,
initially. It is complicated and not so obvious - at least to me!

You've added a lot of context to this patch in your various replies on
this thread. The original patch description explains how the feature is
implemented, but doesn't really touch on why - that is left to the
other patches that you pointed to above. A short paragraph or two on
the "why" side of things added to the patch description would be
helpful I think.

Your message on Friday (20 Aug 2021 15:17:41 +0200 (20/08/21 14:17:41))
has a good explanation in the second part of it, which with what you've
said above (or similar) would be a good basis.

Apologies again for not understanding what is going on,

Steve.




Re: [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion

2021-08-23 Thread Steven Whitehouse
On Fri, 2021-08-20 at 17:22 +0200, Andreas Gruenbacher wrote:
> On Fri, Aug 20, 2021 at 3:11 PM Bob Peterson 
> wrote:
> > 
[snip]
> > 
> > You can almost think of this as a performance enhancement. This
> > concept
> > allows a process to hold a glock for much longer periods of time,
> > at a
> > lower priority, for example, when gfs2_file_read_iter needs to hold
> > the
> > glock for very long-running iterative reads.
> 
> Consider a process that allocates a somewhat large buffer and reads
> into it in chunks that are not page aligned. The buffer initially
> won't be faulted in, so we fault in the first chunk and write into
> it.
> Then, when reading the second chunk, we find that the first page of
> the second chunk is already present. We fill it, set the
> HIF_MAY_DEMOTE flag, fault in more pages, and clear the
> HIF_MAY_DEMOTE
> flag. If we then still have the glock (which is very likely), we
> resume the read. Otherwise, we return a short result.
> 
> Thanks,
> Andreas
> 

If the goal here is just to allow the glock to be held for a longer
period of time, but with occasional interruptions to prevent
starvation, then we have a potential model for this. There is
cond_resched_lock() which does this for spin locks. So perhaps we might
do something similar:

/**
 * gfs2_glock_cond_regain - Conditionally drop and regain glock
 * @gl: The glock
 * @gh: A granted holder for the glock
 *
 * If there is a pending demote request for this glock, drop and 
 * requeue a lock request for this glock. If there is no pending
 * demote request, this is a no-op. In either case the glock is
 * held on both entry and exit.
 *
 * Returns: 0 if no pending demote, 1 if lock dropped and regained
 */
int gfs2_glock_cond_regain(struct gfs2_glock *gl, struct gfs2_holder
*gh);

That seems more easily understood, and clearly documents places where
the lock may be dropped and regained. I think that the implementation
should be simpler and cleaner, compared with the current proposed
patch. There are only two bit flags related to pending demotes, for
example, so the check should be trivial.

It may need a few changes depending on the exact circumstances, but
hopefully that illustrates the concept,

Steve.





Re: [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion

2021-08-20 Thread Steven Whitehouse
Hi,

On Fri, 2021-08-20 at 15:17 +0200, Andreas Gruenbacher wrote:
> On Fri, Aug 20, 2021 at 11:35 AM Steven Whitehouse <
> swhit...@redhat.com> wrote:
> > On Thu, 2021-08-19 at 21:40 +0200, Andreas Gruenbacher wrote:
> > > From: Bob Peterson 
> > > 
> > > This patch introduces a new HIF_MAY_DEMOTE flag and
> > > infrastructure
> > > that will allow glocks to be demoted automatically on locking
> > > conflicts.
> > > When a locking request comes in that isn't compatible with the
> > > locking
> > > state of a holder and that holder has the HIF_MAY_DEMOTE flag
> > > set, the
> > > holder will be demoted automatically before the incoming locking
> > > request
> > > is granted.
> > 
> > I'm not sure I understand what is going on here. When there are
> > locking
> > conflicts we generate call backs and those result in glock
> > demotion.
> > There is no need for a flag to indicate that I think, since it is
> > the
> > default behaviour anyway. Or perhaps the explanation is just a bit
> > confusing...
> 
> When a glock has active holders (with the HIF_HOLDER flag set), the
> glock won't be demoted to a state incompatible with any of those
> holders.
> 
Ok, that is a much clearer explanation of what the patch does. Active
holders have always prevented demotions previously.

> > > Processes that allow a glock holder to be taken away indicate
> > > this by
> > > calling gfs2_holder_allow_demote().  When they need the glock
> > > again,
> > > they call gfs2_holder_disallow_demote() and then they check if
> > > the
> > > holder is still queued: if it is, they're still holding the
> > > glock; if
> > > it isn't, they need to re-acquire the glock.
> > > 
> > > This allows processes to hang on to locks that could become part
> > > of a
> > > cyclic locking dependency.  The locks will be given up when a
> > > (rare)
> > > conflicting locking request occurs, and don't need to be given up
> > > prematurely.
> > 
> > This seems backwards to me. We already have the glock layer cache
> > the
> > locks until they are required by another node. We also have the min
> > hold time to make sure that we don't bounce locks too much. So what
> > is
> > the problem that you are trying to solve here I wonder?
> 
> This solves the problem of faulting in pages during read and write
> operations: on the one hand, we want to hold the inode glock across
> those operations. On the other hand, those operations may fault in
> pages, which may require taking the same or other inode glocks,
> directly or indirectly, which can deadlock.
> 
> So before we fault in pages, we indicate with
> gfs2_holder_allow_demote(gh) that we can cope if the glock is taken
> away from us. After faulting in the pages, we indicate with
> gfs2_holder_disallow_demote(gh) that we now actually need the glock
> again. At that point, we either still have the glock (i.e., the
> holder
> is still queued and it has the HIF_HOLDER flag set), or we don't.
> 
> The different kinds of read and write operations differ in how they
> handle the latter case:
> 
>  * When a buffered read or write loses the inode glock, it returns a
> short result. This
>prevents torn writes and reading things that have never existed on
> disk in that form.
> 
>  * When a direct read or write loses the inode glock, it re-acquires
> it before resuming
>the operation. Direct I/O is not expected to return partial
> results
> and doesn't provide
>any kind of synchronization among processes.
> 
> We could solve this kind of problem in other ways, for example, by
> keeping a glock generation number, dropping the glock before faulting
> in pages, re-acquiring it afterwards, and checking if the generation
> number has changed. This would still be an additional piece of glock
> infrastructure, but more heavyweight than the HIF_MAY_DEMOTE flag
> which uses the existing glock holder infrastructure.

This is working towards the "why" but could probably be summarised a
bit more. We always used to manage to avoid holding fs locks when
copying to/from userspace to avoid these complications. If that is no
longer possible then it would be good to document what the new
expectations are somewhere suitable in Documentation/filesystems/...
just so we make sure it is clear what the new system is, and everyone
will be on the same page,

Steve.





Re: [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion

2021-08-20 Thread Steven Whitehouse
Hi,

On Fri, 2021-08-20 at 08:11 -0500, Bob Peterson wrote:
> On 8/20/21 4:35 AM, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Thu, 2021-08-19 at 21:40 +0200, Andreas Gruenbacher wrote:
> > > From: Bob Peterson 
> > > 
> > > This patch introduces a new HIF_MAY_DEMOTE flag and
> > > infrastructure
> > > that
> > > will allow glocks to be demoted automatically on locking
> > > conflicts.
> > > When a locking request comes in that isn't compatible with the
> > > locking
> > > state of a holder and that holder has the HIF_MAY_DEMOTE flag
> > > set,
> > > the
> > > holder will be demoted automatically before the incoming locking
> > > request
> > > is granted.
> > > 
> > I'm not sure I understand what is going on here. When there are
> > locking
> > conflicts we generate call backs and those result in glock
> > demotion.
> > There is no need for a flag to indicate that I think, since it is
> > the
> > default behaviour anyway. Or perhaps the explanation is just a bit
> > confusing...
> 
> I agree that the whole concept and explanation are confusing.
> Andreas 
> and I went through several heated arguments about the symantics, 
> comments, patch descriptions, etc. We played around with many
> different 
> flag name ideas, etc. We did not agree on the best way to describe
> the 
> whole concept. He didn't like my explanation and I didn't like his.
> So 
> yes, it is confusing.
> 
That seems to be a good reason to take a step back and look at this a
bit closer. If we are finding this confusing, then someone else looking
at it at a future date, who may not be steeped in GFS2 knowledge is
likely to find it almost impossible.

So at least the description needs some work here I think, to make it
much clearer what the overall aim is. It would be good to start with a
statement of the problem that it is trying to solve which Andreas has
hinted at in his reply just now,

Steve.




Re: [Cluster-devel] [PATCH v6 10/19] gfs2: Introduce flag for glock holder auto-demotion

2021-08-20 Thread Steven Whitehouse
Hi,

On Thu, 2021-08-19 at 21:40 +0200, Andreas Gruenbacher wrote:
> From: Bob Peterson 
> 
> This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure
> that
> will allow glocks to be demoted automatically on locking conflicts.
> When a locking request comes in that isn't compatible with the
> locking
> state of a holder and that holder has the HIF_MAY_DEMOTE flag set,
> the
> holder will be demoted automatically before the incoming locking
> request
> is granted.
> 
I'm not sure I understand what is going on here. When there are locking
conflicts we generate call backs and those result in glock demotion.
There is no need for a flag to indicate that I think, since it is the
default behaviour anyway. Or perhaps the explanation is just a bit
confusing...

> Processes that allow a glock holder to be taken away indicate this by
> calling gfs2_holder_allow_demote().  When they need the glock again,
> they call gfs2_holder_disallow_demote() and then they check if the
> holder is still queued: if it is, they're still holding the glock; if
> it
> isn't, they need to re-acquire the glock.
> 
> This allows processes to hang on to locks that could become part of a
> cyclic locking dependency.  The locks will be given up when a (rare)
> conflicting locking request occurs, and don't need to be given up
> prematurely.
This seems backwards to me. We already have the glock layer cache the
locks until they are required by another node. We also have the min
hold time to make sure that we don't bounce locks too much. So what is
the problem that you are trying to solve here I wonder?

> 
> Signed-off-by: Bob Peterson 
> ---
>  fs/gfs2/glock.c  | 221 +++
> 
>  fs/gfs2/glock.h  |  20 +
>  fs/gfs2/incore.h |   1 +
>  3 files changed, 206 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
> index f24db2ececfb..d1b06a09ce2f 100644
> --- a/fs/gfs2/glock.c
> +++ b/fs/gfs2/glock.c
> @@ -58,6 +58,7 @@ struct gfs2_glock_iter {
>  typedef void (*glock_examiner) (struct gfs2_glock * gl);
>  
>  static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh,
> unsigned int target);
> +static void __gfs2_glock_dq(struct gfs2_holder *gh);
>  
>  static struct dentry *gfs2_root;
>  static struct workqueue_struct *glock_workqueue;
> @@ -197,6 +198,12 @@ static int demote_ok(const struct gfs2_glock
> *gl)
>  
>   if (gl->gl_state == LM_ST_UNLOCKED)
>   return 0;
> + /*
> +  * Note that demote_ok is used for the lru process of disposing
> of
> +  * glocks. For this purpose, we don't care if the glock's
> holders
> +  * have the HIF_MAY_DEMOTE flag set or not. If someone is using
> +  * them, don't demote.
> +  */
>   if (!list_empty(>gl_holders))
>   return 0;
>   if (glops->go_demote_ok)
> @@ -379,7 +386,7 @@ static void do_error(struct gfs2_glock *gl, const
> int ret)
>   struct gfs2_holder *gh, *tmp;
>  
>   list_for_each_entry_safe(gh, tmp, >gl_holders, gh_list) {
> - if (test_bit(HIF_HOLDER, >gh_iflags))
> + if (!test_bit(HIF_WAIT, >gh_iflags))
>   continue;
>   if (ret & LM_OUT_ERROR)
>   gh->gh_error = -EIO;
> @@ -393,6 +400,40 @@ static void do_error(struct gfs2_glock *gl,
> const int ret)
>   }
>  }
>  
> +/**
> + * demote_incompat_holders - demote incompatible demoteable holders
> + * @gl: the glock we want to promote
> + * @new_gh: the new holder to be promoted
> + */
> +static void demote_incompat_holders(struct gfs2_glock *gl,
> + struct gfs2_holder *new_gh)
> +{
> + struct gfs2_holder *gh;
> +
> + /*
> +  * Demote incompatible holders before we make ourselves
> eligible.
> +  * (This holder may or may not allow auto-demoting, but we
> don't want
> +  * to demote the new holder before it's even granted.)
> +  */
> + list_for_each_entry(gh, >gl_holders, gh_list) {
> + /*
> +  * Since holders are at the front of the list, we stop
> when we
> +  * find the first non-holder.
> +  */
> + if (!test_bit(HIF_HOLDER, >gh_iflags))
> + return;
> + if (test_bit(HIF_MAY_DEMOTE, >gh_iflags) &&
> + !may_grant(gl, new_gh, gh)) {
> + /*
> +  * We should not recurse into do_promote
> because
> +  * __gfs2_glock_dq only calls handle_callback,
> +  * gfs2_glock_add_to_lru and
> __gfs2_glock_queue_work.
> +  */
> + __gfs2_glock_dq(gh);
> + }
> + }
> +}
> +
>  /**
>   * find_first_holder - find the first "holder" gh
>   * @gl: the glock
> @@ -411,6 +452,26 @@ static inline struct gfs2_holder
> *find_first_holder(const struct gfs2_glock *gl)
>   return NULL;
>  }
>  
> +/**
> + * find_first_strong_holder - find 

Re: [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode

2021-07-28 Thread Steven Whitehouse
Hi,

On Wed, 2021-07-28 at 08:50 +0200, Andreas Gruenbacher wrote:
> On Tue, Jul 13, 2021 at 9:34 PM Bob Peterson 
> wrote:
> > On 7/13/21 1:26 PM, Steven Whitehouse wrote:
> > 
> > Hi,
> > 
> > On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> > 
> > Before this patch, gfs2 kept its own address space for rgrps, but
> > this
> > caused a lockdep problem because vfs assumes a 1:1 relationship
> > between
> > address spaces and their inode. One problematic area is this:
> > 
> > I don't think that is the case. The reason that the address space
> > is a
> > separate structure in the first place is to allow them to exist
> > without
> > an inode. Maybe that has changed, but we should see why that is, in
> > that case rather than just making this change immediately.
> > 
> > I can't see any reason why if we have to have an inode here that it
> > needs to be hashed... what would need to look it up via the hashes?
> > 
> > Steve.
> > 
> > Hi,
> > 
> > The actual use case, which is easily demonstrated with lockdep, is
> > given
> > in the patch text shortly after where you placed your comment. This
> > goes
> > back to this discussion from April 2018:
> > 
> > https://listman.redhat.com/archives/cluster-devel/2018-April/msg00017.html
> > 
> > in which Jan Kara pointed out that:
> > 
> > "The problem is we really do expect mapping->host->i_mapping ==
> > mapping as
> > we pass mapping and inode interchangeably in the mm code. The
> > address_space
> > and inodes are separate structures because you can have many inodes
> > pointing to one address space (block devices). However it is not
> > allowed
> > for several address_spaces to point to one inode!"
> 
> This is fundamentally at adds with how we manage inodes: we have
> inode->i_mapping which is the logical address space of the inode, and
> we have gfs2_glock2aspace(GFS2_I(inode)->i_gl) which is the metadata
> address space of the inode. The most important function of the
> metadata address space is to remove the inode's metadata from memory
> by truncating the metadata address space (inode_go_inval). We need
> that when moving an inode to another node. I don't have the faintest
> idea how we could otherwise achieve that in a somewhat efficient way.
> 
> Thanks,
> Andreas
> 

In addition, I'm fairly sure also that we were told to use this
solution (i.e. a separate address space) back in the day because it was
expected that they didn't have a 1:1 relationship with inodes. I don't
think we'd have used that solution otherwise. I've not had enough time
to go digging back in my email to check, but it might be worth looking
to see when we introduced the use of the second address space (removing
a whole additional inode structure) and any discussions around that
change,

Steve.





Re: [Cluster-devel] [syzbot] WARNING in __set_page_dirty

2021-07-22 Thread Steven Whitehouse
Hi,

On Thu, 2021-07-22 at 08:16 -0500, Bob Peterson wrote:
> On 7/21/21 4:58 PM, Andrew Morton wrote:
> > (cc gfs2 maintainers)
> > 
> > On Tue, 20 Jul 2021 19:07:25 -0700 syzbot <
> > syzbot+0d5b462a6f0744799...@syzkaller.appspotmail.com> wrote:
> > 
> > > Hello,
> > > 
> > > syzbot found the following issue on:
> > > 
> > > HEAD commit:d936eb238744 Revert "Makefile: Enable -Wimplicit-
> > > fallthrou..
> > > git tree:   upstream
> > > console output: 
> > > https://syzkaller.appspot.com/x/log.txt?x=1512834a30
> > > kernel config:  
> > > https://syzkaller.appspot.com/x/.config?x=f1b998c1afc13578
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=0d5b462a6f07447991b3
> > > userspace arch: i386
> > > 
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > > 
> > > IMPORTANT: if you fix the issue, please add the following tag to
> > > the commit:
> > > Reported-by: 
> > > syzbot+0d5b462a6f0744799...@syzkaller.appspotmail.com
> > > 
> > > [ cut here ]
> > > WARNING: CPU: 0 PID: 8696 at include/linux/backing-dev.h:283
> > > inode_to_wb include/linux/backing-dev.h:283 [inline]
> > > WARNING: CPU: 0 PID: 8696 at include/linux/backing-dev.h:283
> > > account_page_dirtied mm/page-writeback.c:2435 [inline]
> > > WARNING: CPU: 0 PID: 8696 at include/linux/backing-dev.h:283
> > > __set_page_dirty+0xace/0x1070 mm/page-writeback.c:2483
> >  
> 
> Okay, sorry for the brain fart earlier. After taking a better look, I
> know exactly what this is.
> This goes back to this discussion from April 2018:
> 
> https://listman.redhat.com/archives/cluster-devel/2018-April/msg00017.html
> 
> in which Jan Kara pointed out that:
> 
> "The problem is we really do expect mapping->host->i_mapping ==
> mapping as
> we pass mapping and inode interchangebly in the mm code. The
> address_space
> and inodes are separate structures because you can have many inodes
> pointing to one address space (block devices). However it is not
> allowed
> for several address_spaces to point to one inode!"
> The problem is that GFS2 keeps separate address spaces for its
> glocks, and they
> don't correspond 1:1 to any inode. So mapping->host is not really an
> inode for these,
> and there's really almost no relation between the glock->mapping and
> the inode it
> points to.
> 
> Even in the recent past, GFS2 did this for all metadata for both its
> media-backed glocks:
> resource groups and inodes.
> 
> I recently posted a patch set to cluster-devel ("gfs2: replace
> sd_aspace with sd_inode" -
> https://listman.redhat.com/archives/cluster-devel/2021-July/msg00066.html) in
> which
> I fixed half the problem, which is the resource group case.
> 
> Unfortunately, for inode glocks it gets a lot trickier and I haven't
> found a proper solution.
> But as I said, it's been a known issue for several years now. The
> errors only appear
> if LOCKDEP is turned on. It would be ideal if address spaces were
> treated as fully
> independent from their inodes, but no one seemed to jump on that
> idea, nor even try to
> explain why we make the assumptions Jan Kara pointed out.
> 
> In the meantime, I'll keep looking for a more proper solution. This
> won't be an easy
> thing to fix or I would have already fixed it.
> 
> Regards,
> 
> Bob Peterson
> 
> 

The reason for having address_spaces pointed to by many inodes is to
allow for stackable filesytems so that you can make the file content
available on the upper layer by just pointing the upper layer inode at
the lower layer address_space. That is presumably what Jan is thinking
of.

This however seems to be an issue with a page flag, so it isn't clear
why that would relate to the address_space? If the page is metadata
which would be the most usual case for something being unpinned, then
that page should definitely be up to date.

Looking back at the earlier rgrp fix mentioned above, the fix is not
unreasonable since there only needs to be a single inode to contain all
the rgrps. For the inode metadata that is not the case, there is a one
to one mapping between inodes and metadata address_spaces, and if the
working assumption is that multiple address_spaces per inode is not
allowed, then I think that has changed over time. I'm pretty sure that
I had checked the expectations way back when we adopted this solution,
and that there were no issues with the multiple address_spaces per
inode case. We definitely don't want to go back to adding an additional
struct inode structure (which does nothing except take up space!) to
each "real" inode in cache, because it is a big overhead in case of a
filesystem with many small files.

Still if this is only a lockdep issue, then we likely have some time to
figure out a good long term solution,

Steve.





Re: [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog

2021-07-14 Thread Steven Whitehouse
Hi,

On Tue, 2021-07-13 at 15:03 -0500, Bob Peterson wrote:
> On 7/13/21 1:41 PM, Steven Whitehouse wrote:
> > Hi,
> > 
> > On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> > > This patch adds a new watchdog whose sole purpose is to complain
> > > when
> > > gfs2_log_flush operations are taking too long.
> > > 
> > This one is a bit confusing. It says that it is to check if the log
> > flush is taking too long, but it appears to set a timeout based on
> > the
> > amount of dirty data that will be written back, so it isn't really
> > the
> > log flush, but the writeback and log flush that is being timed I
> > think?
> > 
> > It also looks like the timeout is entirely dependent upon the
> > number of
> > dirty pages too, and not on the log flush size. I wonder about the
> > performance impact of traversing the list of dirty pages too. If
> > that
> > can be avoided it should make the implementation rather more
> > efficient,
> > 
> > Steve.
> 
> Well, perhaps my patch description was misleading. The watchdog is
> meant
> to time how long function gfs2_log_flush() holds the
> sd_log_flush_lock rwsem
> in write mode. 

I think it needs looking at a bit more carefully. That lock is really
an implementation detail, and one that we expect will change in the not
too distant future as the log code improves.

As you say the description is confusing, and the log messages even more
so, since they give a page count that refers to the ordered writes and
not to the log writes at all. 

Also, we have tools already that should be able to diagnose this issue
(slow I/O) such as blktrace, although I know that is more tricky to
catch after the fact. So I think we need to look at this again to see
if there is a better solution.


> That includes writing the ordered writes as well as the
> metadata. The metadata portion is almost always outweighed 100:1 (or
> more)
> by the ordered writes. The length of time it will take to do the
> ordered 
> writes
> should be based on the number of dirty pages. I don't think running
> the
> ordered writes list will impact performance too badly, and that's one
> reason
> I chose to do it before we actually take the sd_log_flush_lock. It
> does, 
> however,
> hold the sd_ordered_lock lock during its count. Still, it's small 
> compared to
> counting the actual pages or something, and modern cpus can run lists
> very
> quickly.
> 
What limits do we have on the ordered write list length? I seem to
remember we had addressed that issue at some point in the past.
Generally though iterating over what might be quite a long list is not
a good plan from a performance perspective,

Steve.

> My initial version didn't count at all; it just used an arbitrary
> number of
> seconds any log flush _ought_ to take. However, Barry pointed out
> that older
> hardware can be slow when driven to extremes and I didn't want false
> positives.
> 
> I also thought about keeping an interactive count whenever pages are
> dirtied, or when inodes are added to the ordered writes list, but
> that 
> seemed
> like overkill. But it is a reasonable alternative.
> 
> The timeout value is also somewhat arbitrary, but I'm open to
> suggestions.
> 
> In my case, faulty hardware caused log flushes to take a very long
> time, 
> which
> caused many transactions and glocks to be blocked a long time and
> eventually
> hit the 120-second kernel watchdog, which gives the impression glocks
> are
> being held a very long time (which they are) for some unknown reason.
> 
> This can manifest on many (often non-faulty) nodes, since glocks can
> be 
> tied up
> indefinitely waiting for a process who has it locked EX but now must
> wait until it can acquire the transaction lock, which is blocked on
> the 
> log flush:
> My goal was to make hardware problems (like faulty HBAs and fibre
> switches)
> NOT seem like cascading gfs2 file system problems or slowdowns.
> 
> These messages will hopefully prompt operations people to investigate
> the
> cause of the slowdown.
> 
> I tested this patch with faulty hardware, and it yielded messages
> like:
> 
> [ 2127.027527] gfs2: fsid=bobsrhel8:test.0: log flush pid 256206 took
> > 
> 20 seconds to write 98 pages.
> [ 2348.979535] gfs2: fsid=bobsrhel8:test.0: log flush pid 256681 took
> > 
> 1 seconds to write 1 pages.
> [ 3643.571505] gfs2: fsid=bobsrhel8:test.0: log flush pid 262385 took
> > 
> 4 seconds to write 16 pages.
> 
> Regards,
> 
> Bob Peterson
> 
> 



Re: [Cluster-devel] [GFS2 PATCH 08/10] gfs2: New log flush watchdog

2021-07-13 Thread Steven Whitehouse
Hi,

On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> This patch adds a new watchdog whose sole purpose is to complain when
> gfs2_log_flush operations are taking too long.
> 
This one is a bit confusing. It says that it is to check if the log
flush is taking too long, but it appears to set a timeout based on the
amount of dirty data that will be written back, so it isn't really the
log flush, but the writeback and log flush that is being timed I think?

It also looks like the timeout is entirely dependent upon the number of
dirty pages too, and not on the log flush size. I wonder about the
performance impact of traversing the list of dirty pages too. If that
can be avoided it should make the implementation rather more efficient,

Steve.

> Signed-off-by: Bob Peterson 
> ---
>  fs/gfs2/incore.h |  6 ++
>  fs/gfs2/log.c| 47
> 
>  fs/gfs2/log.h|  1 +
>  fs/gfs2/main.c   |  8 
>  fs/gfs2/ops_fstype.c |  2 ++
>  fs/gfs2/sys.c|  6 --
>  6 files changed, 68 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 6f31a067a5f2..566c0053b7c5 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -683,6 +683,8 @@ struct local_statfs_inode {
>   unsigned int si_jid; /* journal id this statfs inode
> corresponds to */
>  };
>  
> +#define GFS2_LOG_FLUSH_TIMEOUT (HZ / 10) /* arbitrary: 1/10 second
> per page */
> +
>  struct gfs2_sbd {
>   struct super_block *sd_vfs;
>   struct gfs2_pcpu_lkstats __percpu *sd_lkstats;
> @@ -849,6 +851,10 @@ struct gfs2_sbd {
>   unsigned long sd_last_warning;
>   struct dentry *debugfs_dir;/* debugfs directory */
>   unsigned long sd_glock_dqs_held;
> +
> + struct delayed_work sd_log_flush_watchdog;
> + unsigned long sd_dirty_pages;
> + unsigned long sd_log_flush_start;
>  };
>  
>  static inline void gfs2_glstats_inc(struct gfs2_glock *gl, int
> which)
> diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
> index f0ee3ff6f9a8..bd2ff5ef4b91 100644
> --- a/fs/gfs2/log.c
> +++ b/fs/gfs2/log.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "gfs2.h"
>  #include "incore.h"
> @@ -32,8 +33,22 @@
>  #include "trace_gfs2.h"
>  #include "trans.h"
>  
> +extern struct workqueue_struct *gfs2_log_flush_wq;
> +
>  static void gfs2_log_shutdown(struct gfs2_sbd *sdp);
>  
> +void gfs2_log_flush_watchdog_func(struct work_struct *work)
> +{
> + struct delayed_work *dwork = to_delayed_work(work);
> + struct gfs2_sbd *sdp = container_of(dwork, struct gfs2_sbd,
> + sd_log_flush_watchdog);
> +
> + fs_err(sdp, "log flush pid %u took > %lu secs to write %lu
> pages.\n",
> +sdp->sd_logd_process ? pid_nr(task_pid(sdp-
> >sd_logd_process)) :
> +0, (jiffies - sdp->sd_log_flush_start) / HZ,
> +sdp->sd_dirty_pages);
> +}
> +
>  /**
>   * gfs2_struct2blk - compute stuff
>   * @sdp: the filesystem
> @@ -1016,6 +1031,26 @@ static void trans_drain(struct gfs2_trans *tr)
>   }
>  }
>  
> +/**
> + * count_dirty_pages - rough count the dirty ordered writes pages
> + * @sdp: the filesystem
> + *
> + * This is not meant to be exact. It's simply a rough estimate of
> how many
> + * dirty pages are on the ordered writes list. The actual number of
> pages
> + * may change because we don't keep the lock held during the log
> flush.
> + */
> +static unsigned long count_dirty_pages(struct gfs2_sbd *sdp)
> +{
> + struct gfs2_inode *ip;
> + unsigned long dpages = 0;
> +
> + spin_lock(>sd_ordered_lock);
> + list_for_each_entry(ip, >sd_log_ordered, i_ordered)
> + dpages += ip->i_inode.i_mapping->nrpages;
> + spin_unlock(>sd_ordered_lock);
> + return dpages;
> +}
> +
>  /**
>   * gfs2_log_flush - flush incore transaction(s)
>   * @sdp: The filesystem
> @@ -1031,8 +1066,19 @@ void gfs2_log_flush(struct gfs2_sbd *sdp,
> struct gfs2_glock *gl, u32 flags)
>   enum gfs2_freeze_state state = atomic_read(
> >sd_freeze_state);
>   unsigned int first_log_head;
>   unsigned int reserved_revokes = 0;
> + unsigned long dpages;
> +
> + dpages = count_dirty_pages(sdp);
>  
>   down_write(>sd_log_flush_lock);
> + if (dpages)
> + if (queue_delayed_work(gfs2_log_flush_wq,
> +>sd_log_flush_watchdog,
> +round_up(dpages *
> + GFS2_LOG_FLUSH_TIMEOUT,
> HZ))) {
> + sdp->sd_dirty_pages = dpages;
> + sdp->sd_log_flush_start = jiffies;
> + }
>   trace_gfs2_log_flush(sdp, 1, flags);
>  
>  repeat:
> @@ -1144,6 +1190,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp,
> struct gfs2_glock *gl, u32 flags)
>   gfs2_assert_withdraw_delayed(sdp, used_blocks <
> reserved_blocks);
>   

Re: [Cluster-devel] [GFS2 PATCH 10/10] gfs2: replace sd_aspace with sd_inode

2021-07-13 Thread Steven Whitehouse
Hi,

On Tue, 2021-07-13 at 13:09 -0500, Bob Peterson wrote:
> Before this patch, gfs2 kept its own address space for rgrps, but
> this
> caused a lockdep problem because vfs assumes a 1:1 relationship
> between
> address spaces and their inode. One problematic area is this:
> 
I don't think that is the case. The reason that the address space is a
separate structure in the first place is to allow them to exist without
an inode. Maybe that has changed, but we should see why that is, in
that case rather than just making this change immediately.

I can't see any reason why if we have to have an inode here that it
needs to be hashed... what would need to look it up via the hashes?

Steve.

> gfs2_unpin
>mark_buffer_dirty(bh);
>   mapping = page_mapping(page);
>  __set_page_dirty(page, mapping, memcg, 0);
> xa_lock_irqsave(>i_pages, flags);
>  ^---locks page->mapping->i_pages
> account_page_dirtied(page, mapping)
>  struct inode *inode = mapping->host;
>  ^---assumes the mapping points to an inode
>inode_to_wb(inode)
>   WARN_ON_ONCE !lockdep_is_held(>i_mapping->
> i_pages.xa_lock)
> 
> It manifests as a lockdep warning you see in the last line.
> 
> This patch removes sd_aspace in favor of an entire inode, sd_inode.
> Functions that need to access the address space may use a new
> function
> that follows the inode to its address space. This creates the 1:1
> relation
> between the inode and its address space, so lockdep doesn't complain.
> This is how some other file systems manage their metadata, such as
> btrfs.
> 
> Signed-off-by: Bob Peterson 
> ---
>  fs/gfs2/glops.c  |  4 ++--
>  fs/gfs2/incore.h |  7 ++-
>  fs/gfs2/meta_io.c|  2 +-
>  fs/gfs2/meta_io.h|  2 --
>  fs/gfs2/ops_fstype.c | 27 ---
>  fs/gfs2/super.c  |  2 +-
>  6 files changed, 26 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
> index 744cacd27213..5d755d30d91c 100644
> --- a/fs/gfs2/glops.c
> +++ b/fs/gfs2/glops.c
> @@ -162,7 +162,7 @@ void gfs2_ail_flush(struct gfs2_glock *gl, bool
> fsync)
>  static int gfs2_rgrp_metasync(struct gfs2_glock *gl)
>  {
>   struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> - struct address_space *metamapping = >sd_aspace;
> + struct address_space *metamapping = gfs2_aspace(sdp);
>   struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
>   const unsigned bsize = sdp->sd_sb.sb_bsize;
>   loff_t start = (rgd->rd_addr * bsize) & PAGE_MASK;
> @@ -219,7 +219,7 @@ static int rgrp_go_sync(struct gfs2_glock *gl)
>  static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
>  {
>   struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
> - struct address_space *mapping = >sd_aspace;
> + struct address_space *mapping = gfs2_aspace(sdp);
>   struct gfs2_rgrpd *rgd = gfs2_glock2rgrp(gl);
>   const unsigned bsize = sdp->sd_sb.sb_bsize;
>   loff_t start = (rgd->rd_addr * bsize) & PAGE_MASK;
> diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
> index 566c0053b7c5..075e5db1d654 100644
> --- a/fs/gfs2/incore.h
> +++ b/fs/gfs2/incore.h
> @@ -797,7 +797,7 @@ struct gfs2_sbd {
>  
>   /* Log stuff */
>  
> - struct address_space sd_aspace;
> + struct inode *sd_inode;
>  
>   spinlock_t sd_log_lock;
>  
> @@ -857,6 +857,11 @@ struct gfs2_sbd {
>   unsigned long sd_log_flush_start;
>  };
>  
> +static inline struct address_space *gfs2_aspace(struct gfs2_sbd
> *sdp)
> +{
> + return sdp->sd_inode->i_mapping;
> +}
> +
>  static inline void gfs2_glstats_inc(struct gfs2_glock *gl, int
> which)
>  {
>   gl->gl_stats.stats[which]++;
> diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
> index 7c9619997355..0123437d9c12 100644
> --- a/fs/gfs2/meta_io.c
> +++ b/fs/gfs2/meta_io.c
> @@ -120,7 +120,7 @@ struct buffer_head *gfs2_getbuf(struct gfs2_glock
> *gl, u64 blkno, int create)
>   unsigned int bufnum;
>  
>   if (mapping == NULL)
> - mapping = >sd_aspace;
> + mapping = gfs2_aspace(sdp);
>  
>   shift = PAGE_SHIFT - sdp->sd_sb.sb_bsize_shift;
>   index = blkno >> shift; /* convert block to page */
> diff --git a/fs/gfs2/meta_io.h b/fs/gfs2/meta_io.h
> index 21880d72081a..70b9c41ecb46 100644
> --- a/fs/gfs2/meta_io.h
> +++ b/fs/gfs2/meta_io.h
> @@ -42,8 +42,6 @@ static inline struct gfs2_sbd
> *gfs2_mapping2sbd(struct address_space *mapping)
>   struct inode *inode = mapping->host;
>   if (mapping->a_ops == _meta_aops)
>   return (((struct gfs2_glock *)mapping) - 1)-
> >gl_name.ln_sbd;
> - else if (mapping->a_ops == _rgrp_aops)
> - return container_of(mapping, struct gfs2_sbd,
> sd_aspace);
>   else
>   return inode->i_sb->s_fs_info;
>  }
> diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
> index b09e61457b23..3e252cfa7f17 100644
> --- 

Re: [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)

2021-06-13 Thread Steven Whitehouse
Hi,

On Sat, 2021-06-12 at 21:35 +, Al Viro wrote:
> On Sat, Jun 12, 2021 at 09:05:40PM +, Al Viro wrote:
> 
> > Is the above an accurate description of the mainline situation
> > there?
> > In particular, normal read doesn't seem to bother with locks at
> > all.
> > What exactly are those cluster locks for in O_DIRECT read?
> 
> BTW, assuming the lack of contention, how costly is
> dropping/regaining
> such cluster lock?
> 

The answer is that it depends...

The locking modes for glocks for inodes look like this:

==  ==   ==   ==   ==
Glock mode  Cache data   Cache Metadata   Dirty Data   Dirty Metadata
==  ==   ==   ==   ==
UN No  No NoNo
SH Yes YesNoNo
DF No  YesNoNo
EX Yes YesYes   Yes
==  ==   ==   ==   ==

The above is a copy & paste from Documentation/filesystems/gfs2-
glocks.rst. If you think of these locks as cache control, then it makes
a lot more sense.

The DF (deferred) mode is there only for DIO. It is a shared lock mode
that is incompatible with the normal SH mode. That is because it is ok
to cache data pages under SH but not under DF. That the only other
difference between the two shared modes. DF is used for both read and
write under DIO meaning that it is possible for multiple nodes to read
& write the same file at the same time with DIO, leaving any
synchronisation to the application layer. As soon as one performs an
operation which alters the metadata tree (truncate, extend, hole
filling) then we drop back to the normal EX mode, so DF is only used
for preallocated files.

Your original question though was about the cost of locking, and there
is a wide variation according to circumstances. The glock layer caches
the results of the DLM requests and will continue to hold glocks gained
from remote nodes until either memory pressure or requests to drop the
lock from another node is received.

When no other nodes are interested in a lock, all such cluster lock
activity is local. There is a cost to it though, and if (for example)
you tried to take and drop the cluster lock on every page, that would
definitely be noticeable. There are probably optimisations that could
be done on what is quite a complex code path, but in general thats what
we've discovered from testing. The introduction of ->readpages() vs the
old ->readpage() made a measurable difference and likewise on the write
side, iomap has also show performance increases due to the reduction in
locking on multi-page writes.

If there is another node that has an interest in a lock, then it can
get very expensive in terms of latency to regain a lock. To drop the
lock to a lower mode may involve I/O (from EX mode) and journal
flush(es) and to get the lock back again involves I/O to other nodes
and then a wait while they finish what they are doing. To avoid
starvation there is a "minimum hold time" so that when a node gains a
glock, it is allowed to retain it, in the absence of local requests,
for a short period. The idea being that if a large number of glock
requests are being made on a node, each for a short time, we allow
several of those to complete before we do the expensive glock release
to another node.

See Documentation/filesystems/gfs2-glocks.rst for a longer explanation
and locking order/rules between different lock types,

Steve.




Re: [Cluster-devel] [PATCH gfs2-utils] man: gfs2.5: remove barrier automatically turned off note

2021-03-29 Thread Steven Whitehouse

Hi,

On 26/03/2021 13:46, Alexander Ahring Oder Aring wrote:

Hi,

On Fri, Mar 26, 2021 at 6:41 AM Andrew Price  wrote:

On 25/03/2021 17:58, Alexander Aring wrote:

This patch removes a note that the barrier option is automatically turned
off if the underlaying device doesn't support I/O barriers. So far I
understand it's default on, means "barriers" option is applied which
should not make any problems if the underlaying device supports something
or not. There is by the kernel or gfs2-utils no automatically detection
going on which changes this mount option.

Hm, should there be automatic detection? Has there ever been? I'd like
to get to the bottom of why this language is here before removing it.


no idea if there was ever an auto detection or there exists currently
one. I didn't find any auto detection during my research. The related
part came in by: 06b5fb87 ("gfs2: man page updates"). My understanding
is that this option is default "barrier" and you should do "nobarrier"
in cases when you know what you are doing. I even don't know if such
automatic detection is possible, the man-page says "(e.g. its on a
UPS, or it doesn't have a write cache)" in regards to block devices. I
think there is no way in the kernel/user space to check if the block
device is behind a UPS. Maybe there exists some in user space over
hdparm but then things need to be right connected? Regarding cache
handling, you need to know a lot about the used architecture.

I am not sure here as well. I was reading about such automatic
detection and wanted to see how it's done with the result: there is no
auto detection (in gfs2(kernel)/gfs2-utils software)?

- Alex



Bare in mind that the naming of the barrier mount option is historic and 
that it is now implemented using flush commands rather than barriers. I 
don't think there is any automated way to discover if it is safe to run 
without flushes,


Steve.




Re: [Cluster-devel] [GFS2 PATCH] gfs2: Add new sysfs file for gfs2 status

2021-03-19 Thread Steven Whitehouse



On 19/03/2021 12:06, Bob Peterson wrote:

This patch adds a new file: /sys/fs/gfs2/*/status which will report
the status of the file system. Catting this file dumps the current
status of the file system according to various superblock variables.
For example:

Journal Checked:  1
Journal Live: 1
Journal ID:   0
Spectator:0
Withdrawn:0
No barriers:  0
No recovery:  0
Demote:   0
No Journal ID:1
Mounted RO:   0
RO Recovery:  0
Skip DLM Unlock:  0
Force AIL Flush:  0
FS Frozen:0
Withdrawing:  0
Withdraw In Prog: 0
Remote Withdraw:  0
Withdraw Recovery:0
sd_log_lock held: 0
statfs_spin held: 0
sd_rindex_spin:   0
sd_jindex_spin:   0
sd_trunc_lock:0
sd_bitmap_lock:   0
sd_ordered_lock:  0
sd_ail_lock:  0
sd_log_error: 0
sd_log_flush_lock:0
sd_log_num_revoke:0
sd_log_in_flight: 0
sd_log_blks_needed:   0
sd_log_blks_free: 32768
sd_log_flush_head:0
sd_log_flush_tail:5384
sd_log_blks_reserved: 0
sd_log_revokes_available: 503

Signed-off-by: Bob Peterson 


It looks like it might be missing some locking on some of those variables?

Steve.



---
  fs/gfs2/sys.c | 83 +++
  1 file changed, 83 insertions(+)

diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index c3e72dba7418..57f53c13866e 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -63,6 +63,87 @@ static ssize_t id_show(struct gfs2_sbd *sdp, char *buf)
MAJOR(sdp->sd_vfs->s_dev), MINOR(sdp->sd_vfs->s_dev));
  }
  
+static ssize_t status_show(struct gfs2_sbd *sdp, char *buf)

+{
+   unsigned long f = sdp->sd_flags;
+   ssize_t s;
+
+   s = snprintf(buf, PAGE_SIZE,
+"Journal Checked:  %d\n"
+"Journal Live: %d\n"
+"Journal ID:   %d\n"
+"Spectator:%d\n"
+"Withdrawn:%d\n"
+"No barriers:  %d\n"
+"No recovery:  %d\n"
+"Demote:   %d\n"
+"No Journal ID:%d\n"
+"Mounted RO:   %d\n"
+"RO Recovery:  %d\n"
+"Skip DLM Unlock:  %d\n"
+"Force AIL Flush:  %d\n"
+"FS Frozen:%d\n"
+"Withdrawing:  %d\n"
+"Withdraw In Prog: %d\n"
+"Remote Withdraw:  %d\n"
+"Withdraw Recovery:%d\n"
+"sd_log_lock held: %d\n"
+"statfs_spin held: %d\n"
+"sd_rindex_spin:   %d\n"
+"sd_jindex_spin:   %d\n"
+"sd_trunc_lock:%d\n"
+"sd_bitmap_lock:   %d\n"
+"sd_ordered_lock:  %d\n"
+"sd_ail_lock:  %d\n"
+"sd_log_error: %d\n"
+"sd_log_flush_lock:%d\n"
+"sd_log_num_revoke:%u\n"
+"sd_log_in_flight: %d\n"
+"sd_log_blks_needed:   %d\n"
+"sd_log_blks_free: %d\n"
+"sd_log_flush_head:%d\n"
+"sd_log_flush_tail:%d\n"
+"sd_log_blks_reserved: %d\n"
+"sd_log_revokes_available: %d\n",
+test_bit(SDF_JOURNAL_CHECKED, ),
+test_bit(SDF_JOURNAL_LIVE, ),
+(sdp->sd_jdesc ? sdp->sd_jdesc->jd_jid : 0),
+(sdp->sd_args.ar_spectator ? 1 : 0),
+test_bit(SDF_WITHDRAWN, ),
+test_bit(SDF_NOBARRIERS, ),
+test_bit(SDF_NORECOVERY, ),
+test_bit(SDF_DEMOTE, ),
+test_bit(SDF_NOJOURNALID, ),
+(sb_rdonly(sdp->sd_vfs) ? 1 : 0),
+test_bit(SDF_RORECOVERY, ),
+test_bit(SDF_SKIP_DLM_UNLOCK, ),
+test_bit(SDF_FORCE_AIL_FLUSH, ),
+test_bit(SDF_FS_FROZEN, ),
+test_bit(SDF_WITHDRAWING, ),
+test_bit(SDF_WITHDRAW_IN_PROG, ),
+test_bit(SDF_REMOTE_WITHDRAW, ),
+test_bit(SDF_WITHDRAW_RECOVERY, ),
+spin_is_locked(>sd_log_lock),
+spin_is_locked(>sd_statfs_spin),
+

Re: [Cluster-devel] Recording extents in GFS2

2021-02-22 Thread Steven Whitehouse

Hi,

On 20/02/2021 09:48, Andreas Gruenbacher wrote:

Hi all,

once we change the journal format, in addition to recording block 
numbers as extents, there are some additional issues we should address 
at the same time:


I. The current transaction format of our journals is as follows:

  * One METADATA log descriptor block for each [503 / 247 / 119 / 55]
metadata blocks, followed by those metadata blocks. For each
metadata block, the log descriptor records the 64-bit block number.
  * One JDATA log descriptor block for each [251 / 123 / 59 / 27]
metadata blocks, followed by those metadata blocks. For each
metadata block, the log descriptor records the 64-bit block number
and another 64-bit field for indicating whether the block needed
escaping.
  * One REVOKE log descriptor block for the initial [503 / 247 / 119 /
55] revokes, followed by a metadata header (not to be confused
with the log header) for each additional [509 / 253 / 125 / 61]
revokes. Each revoke is recorded as a 64-bit block number in its
REVOKE log descriptor or metadata header.
  * One log header with various necessary and useful metadata that
acts as a COMMIT record. If the log header is incorrect or
missing, the preceding log descriptors are ignored.


 succeeding? (I hope!)
We should change that so that a single log descriptor contains a 
number of records. There should be records for METADATA and JDATA 
blocks that follow, as well as for REVOKES and for COMMIT. If a 
transaction contains metadata and/or jdata blocks, those will 
obviously need a precursor and a commit block like today, but we 
shouldn't need separate blocks for metadata and journaled data in many 
cases. Small transactions that only consist of revokes and of a commit 
should frequently fit into a single block entirely, though.


Yes, it makes sense to try and condense what we are writing. Why would 
we not need to have separate blocks for journaled data though? That one 
seems difficult to avoid, and since it is used so infrequently, perhaps 
not such an important issue.



Right now, we're writing log headers ("commits") with REQ_PREFLUSH to 
make sure all the log descriptors of a transaction make it to disk 
before the log header. Depending on the device, this is often costly. 
If we can fit an entire transaction into a single block, REQ_PREFLUSH 
won't be needed anymore.


I'm not sure I agree. The purpose of the preflush is to ensure that the 
data and the preceding log blocks are really on disk before we write the 
commit record. That will still be required while we use ordered writes, 
even if we can use (as you suggest below) a checksum to cover the whole 
transaction, and thus check for a complete log record after the fact. 
Also, we would still have to issue the flush in the case of a fsync 
derived log flush too.





III. We could also checksum entire transactions to avoid REQ_PREFLUSH. 
At replay time, all the blocks that make up a transaction will either 
be there and the checksum will match, or the transaction will be 
invalid. This should be less prohibitively expensive with CPU support 
for CRC32C nowadays, but depending on the hardware, it may make sense 
to turn this off.


IV. We need recording of unwritten blocks / extents (allocations / 
fallocate) as this will significantly speed up moving glocks from one 
node to another:


That would definitely be a step forward.




At the moment, data=ordered is implemented by keeping a list of all 
inodes that did an ordered write. When it comes time to flush the log, 
the data of all those ordered inodes is flushed first. When all we 
want is to flush a single glock in order to move it to a different 
node, we currently flush all the ordered inodes as well as the journal.


If we only flushed the ordered data of the glock being moved plus the 
entire journal, the ordering guarantees for the other ordered inodes 
in the journal would be violated. In that scenario, unwritten blocks 
could (and would) show up in files after crashes.


If we instead record unwritten blocks in the journal, we'll know which 
blocks need to be zeroed out at recovery time. Once an unwritten block 
is written, we record a REVOKE entry for that block.


This comes at the cost of tracking those blocks of course, but with 
that in place, moving a glock from one node to another will only 
require flushing the underlying inode (assuming it's a inode glock) 
and the journal. And most likely, we won't have to bother with 
implementing "simple" transactions as described in 
https://bugzilla.redhat.com/show_bug.cgi?id=1631499.


Thanks,
Andreas


That would be another way of looking at the problem, yes. It does add a 
lot to the complexity though, and it doesn't scale very well on systems 
with large amounts of memory (and therefore potentially lots of 
unwritten extents to record & keep track of). If there are lots of small 
transactions, then each one might be significantly expanded by 

Re: [Cluster-devel] [gfs2 PATCH] gfs2: Don't skip dlm unlock if glock has an lvb

2021-02-08 Thread Steven Whitehouse

Hi,

Longer term we should review whether this is really the correct fix. It 
seems a bit strange that we have to do something different according to 
whether there is an LVB or not. We are gradually increasing LVB use over 
time too. So should we fix the DLM so that either it can cope with locks 
with LVBs at lockspace shutdown time, or should we simply send an unlock 
for all DLM locks anyway? That would seem to make more sense than having 
two different systems depending on LVB existence, or otherwise,


Steve.

On 05/02/2021 18:50, Bob Peterson wrote:

Patch fb6791d100d1bba20b5cdbc4912e1f7086ec60f8 was designed to allow
gfs2 to unmount quicker by skipping the step where it tells dlm to
unlock glocks in EX with lvbs. This was done because when gfs2 unmounts
a file system, it destroys the dlm lockspace shortly after it destroys
the glocks so it doesn't need to unlock them all: the unlock is implied
when the lockspace is destroyed by dlm.

However, that patch introduced a use-after-free in dlm: as part of its
normal dlm_recoverd process, it can call ls_recovery to recover dead
locks. In so doing, it can call recover_rsbs which calls recover_lvb for
any mastered rsbs. Func recover_lvb runs through the list of lkbs queued
to the given rsb (if the glock is cached but unlocked, it will still be
queued to the lkb, but in NL--Unlocked--mode) and if it has an lvb,
copies it to the rsb, thus trying to preserve the lkb. However, when
gfs2 skips the dlm unlock step, it frees the glock and its lvb, which
means dlm's function recover_lvb references the now freed lvb pointer,
copying the freed lvb memory to the rsb.

This patch changes the check in gdlm_put_lock so that it calls dlm_unlock
for all glocks that contain an lvb pointer.

Signed-off-by: Bob Peterson 
Fixes: fb6791d100d1 "GFS2: skip dlm_unlock calls in unmount"
---
  fs/gfs2/lock_dlm.c | 8 ++--
  1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 9f2b5609f225..153272f82984 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -284,7 +284,6 @@ static void gdlm_put_lock(struct gfs2_glock *gl)
  {
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
struct lm_lockstruct *ls = >sd_lockstruct;
-   int lvb_needs_unlock = 0;
int error;
  
  	if (gl->gl_lksb.sb_lkid == 0) {

@@ -297,13 +296,10 @@ static void gdlm_put_lock(struct gfs2_glock *gl)
gfs2_sbstats_inc(gl, GFS2_LKS_DCOUNT);
gfs2_update_request_times(gl);
  
-	/* don't want to skip dlm_unlock writing the lvb when lock is ex */

-
-   if (gl->gl_lksb.sb_lvbptr && (gl->gl_state == LM_ST_EXCLUSIVE))
-   lvb_needs_unlock = 1;
+   /* don't want to skip dlm_unlock writing the lvb when lock has one */
  
  	if (test_bit(SDF_SKIP_DLM_UNLOCK, >sd_flags) &&

-   !lvb_needs_unlock) {
+   !gl->gl_lksb.sb_lvbptr) {
gfs2_glock_free(gl);
return;
}





Re: [Cluster-devel] Recording extents in GFS2

2021-02-02 Thread Steven Whitehouse

Hi,

On 24/01/2021 06:44, Abhijith Das wrote:

Hi all,

I've been looking at rgrp.c:gfs2_alloc_blocks(), which is called from 
various places to allocate single/multiple blocks for inodes. I've 
come up with some data structures to accomplish recording of these 
allocations as extents.


I'm proposing we add a new metadata type for journal blocks that will 
hold these extent records.


GFS2_METATYPE_EX 15 /* New metadata type for a block that will hold 
extents */


This structure below will be at the start of the block, followed by a 
number of alloc_ext structures.


struct gfs2_extents {/* This structure is 32 bytes long */
struct gfs2_meta_header ex_header;
__be32 ex_count;/* count of number of alloc_ext structs that follow 
this header. */

__be32 __pad;
};
/* flags for the alloc_ext struct */
#define AE_FL_XXX

struct alloc_ext {/* This structure is 48 bytes long */
struct gfs2_inum ae_num;/* The inode this allocation/deallocation 
belongs to */
__be32 ae_flags;/* specifies if we're allocating/deallocating, 
data/metadata, etc. */

__be64 ae_start;/* starting physical block number of the extent */
__be64 ae_len;   /* length of the extent */
__be32 ae_uid;   /* user this belongs to, for quota accounting */
__be32 ae_gid;   /* group this belongs to, for quota accounting */
__be32 __pad;
};

The gfs2_inum structure is a bit OTT for this I think. A single 64 bit 
inode number should be enough? Also, it is quite likely we may have 
multiple extents for the same inode... so should we split this into two 
so we can have something like this? It is more complicated, but should 
save space in the average case.


struct alloc_hdr {

    __be64 inum;

    __be32 uid; /* This is duplicated from the inode... various options 
here depending on whether we think this is something we should do. 
Should we also consider logging chown using this structure? We will have 
to carefully check chown sequence wrt to allocations/deallocations for 
quota purposes */


    __be32 gid;

    __u8 num_extents; /* Never likely to have huge numbers of extents 
per header, due to block size! */


    /* padding... or is there something else we could/should add here? */

};

followed by num_extents copies of:

struct alloc_extent {

    __be64 phys_start;

    __be64 logical_start; /* Do we need a logical & physical start? 
Maybe we don't care about the logical start? */


    __be32 length; /* Max extent length is limited by rgrp length... 
only need 32 bits */


    __be32 flags; /* Can we support unwritten, zero extents with this? 
Need to indicate alloc/free/zero, data/metadata */


};

Just wondering if there is also some shorthand we might be able to use 
in case we have multiple extents all separated by either one metadata 
block, or a very small number of metadata blocks (which will be the case 
for streaming writes). Again it increases the complexity, but will 
likely reduce the amount we have to write into the new journal blocks 
quite a lot. Not much point having a 32 bit length, but never filling it 
with a value above 509 (4k block size)...



With 4k block sizes, we can fit 84 extents (10 for 512b, 20 for 1k, 42 
for 2k block sizes) in one block. As we process more allocs/deallocs, 
we keep creating more such alloc_ext records and tack them to the back 
of this block if there's space or else create a new block. For smaller 
extents, this might not be efficient, so we might just want to revert 
to the old method of recording the bitmap blocks instead.
During journal replay, we decode these new blocks and flip the 
corresponding bitmaps for each of the blocks represented in the 
extents. For the ones where we just recorded the bitmap blocks the 
old-fashioned way, we also replay them the old-fashioned way. This way 
we're also backward compatible with an older version of gfs2 that only 
records the bitmaps.
Since we record the uid/gid with each extent, we can do the quota 
accounting without relying on the quota change file. We might need to 
keep the quota change file around for backward compatibility and for 
the cases where we might want to record allocs/deallocs the 
old-fashioned way.


I'm going to play around with this and come up with some patches to 
see if this works and what kind of performance improvements we get. 
These data structures will mostly likely need reworking and renaming, 
but this is the general direction I'm thinking along.


Please let me know what you think.

Cheers!
--Abhi


That all sounds good. I'm sure it will take a little while to figure out 
how to get this right,


Steve.




Re: [Cluster-devel] [PATCH v3 08/20] gfs2: Get rid of on-stack transactions

2021-01-28 Thread Steven Whitehouse

Hi,

On 27/01/2021 21:07, Andreas Gruenbacher wrote:

On-stack transactions were introduced to work around a transaction glock
deadlock in gfs2_trans_begin in commit d8348de06f70 ("GFS2: Fix deadlock
on journal flush").  Subsequently, transaction glocks were eliminated in
favor of the more efficient freeze glocks in commit 24972557b12c ("GFS2:
remove transaction glock") without also removing the on-stack
transactions.

It has now turned out that committing on-stack transactions
significantly complicates journal free space accounting when no system
transaction (sdp->sd_log_tr) is active at the time.  It doesn't seem
that on-stack transactions provide a significant benefit beyond their
original purpose (as an optimization), so remove them to allow fixing
the journal free space accounting in a reasonable way in a subsequent
patch.

FIXME: Can we better handle a gfs2_trans_begin failure in gfs2_ail_empty_gl?
If we skip the __gfs2_ail_flush, we'll just end up with leftover items on
gl_ail_list.


The reason for the on-stack allocation is to avoid the GFP_NOFAIL 
allocation here. Please don't add it back, we have gradually been 
working to eliminate those. Thoes allocations may not fail, but they 
might also take a long enough time that it would make little difference 
if they did. So perhaps we need to look at another solution?


Steve.




Signed-off-by: Andreas Gruenbacher 
---
  fs/gfs2/glops.c  | 29 +++--
  fs/gfs2/incore.h |  1 -
  fs/gfs2/log.c|  1 -
  fs/gfs2/trans.c  | 25 +
  fs/gfs2/trans.h  |  2 ++
  5 files changed, 22 insertions(+), 36 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 3faa421568b0..853e590ccc15 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -84,18 +84,11 @@ static void __gfs2_ail_flush(struct gfs2_glock *gl, bool 
fsync,
  
  static int gfs2_ail_empty_gl(struct gfs2_glock *gl)

  {
+   unsigned int revokes = atomic_read(>gl_ail_count);
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
-   struct gfs2_trans tr;
int ret;
  
-	memset(, 0, sizeof(tr));

-   INIT_LIST_HEAD(_buf);
-   INIT_LIST_HEAD(_databuf);
-   INIT_LIST_HEAD(_ail1_list);
-   INIT_LIST_HEAD(_ail2_list);
-   tr.tr_revokes = atomic_read(>gl_ail_count);
-
-   if (!tr.tr_revokes) {
+   if (!revokes) {
bool have_revokes;
bool log_in_flight;
  
@@ -122,20 +115,12 @@ static int gfs2_ail_empty_gl(struct gfs2_glock *gl)

return 0;
}
  
-	/* A shortened, inline version of gfs2_trans_begin()

- * tr->alloced is not set since the transaction structure is
- * on the stack */
-   tr.tr_reserved = 1 + gfs2_struct2blk(sdp, tr.tr_revokes);
-   tr.tr_ip = _RET_IP_;
-   ret = gfs2_log_reserve(sdp, tr.tr_reserved);
-   if (ret < 0)
-   return ret;
-   WARN_ON_ONCE(current->journal_info);
-   current->journal_info = 
-
-   __gfs2_ail_flush(gl, 0, tr.tr_revokes);
-
+   ret = __gfs2_trans_begin(sdp, 0, revokes, GFP_NOFS | __GFP_NOFAIL);
+   if (ret)
+   goto flush;
+   __gfs2_ail_flush(gl, 0, revokes);
gfs2_trans_end(sdp);
+
  flush:
gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
   GFS2_LFC_AIL_EMPTY_GL);
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 8e1ab8ed4abc..958810e533ad 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -490,7 +490,6 @@ struct gfs2_quota_data {
  enum {
TR_TOUCHED = 1,
TR_ATTACHED = 2,
-   TR_ALLOCED = 3,
  };
  
  struct gfs2_trans {

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index e4dc23a24569..721d2d7f0efd 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -1114,7 +1114,6 @@ static void log_refund(struct gfs2_sbd *sdp, struct 
gfs2_trans *tr)
if (sdp->sd_log_tr) {
gfs2_merge_trans(sdp, tr);
} else if (tr->tr_num_buf_new || tr->tr_num_databuf_new) {
-   gfs2_assert_withdraw(sdp, test_bit(TR_ALLOCED, >tr_flags));
sdp->sd_log_tr = tr;
set_bit(TR_ATTACHED, >tr_flags);
}
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index 7705f04621f4..4f461ab37ced 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -37,8 +37,8 @@ static void gfs2_print_trans(struct gfs2_sbd *sdp, const 
struct gfs2_trans *tr)
tr->tr_num_revoke, tr->tr_num_revoke_rm);
  }
  
-int gfs2_trans_begin(struct gfs2_sbd *sdp, unsigned int blocks,

-unsigned int revokes)
+int __gfs2_trans_begin(struct gfs2_sbd *sdp, unsigned int blocks,
+  unsigned int revokes, gfp_t gfp_mask)
  {
struct gfs2_trans *tr;
int error;
@@ -52,7 +52,7 @@ int gfs2_trans_begin(struct gfs2_sbd *sdp, unsigned int 
blocks,
if (!test_bit(SDF_JOURNAL_LIVE, >sd_flags))
return -EROFS;
  
-	tr = kmem_cache_zalloc(gfs2_trans_cachep, GFP_NOFS);

+   tr = 

Re: [Cluster-devel] [GFS2 PATCH] gfs2: make recovery workqueue operate on a gfs2 mount point, not journal

2021-01-04 Thread Steven Whitehouse

Hi,

On 22/12/2020 20:38, Bob Peterson wrote:

Hi,

Before this patch, journal recovery was done by a workqueue function that
operated on a per-journal basis. The problem is, these could run simultaneously
which meant that they could all use the same bio, sd_log_bio, to do their
writing to all the various journals. These operations overwrote one another
eventually causing memory corruption.


Why not just add more bios so that this issue goes away? It would make 
more sense than preventing recovery from running in parallel. In general 
recovery should be spread amoung nodes anyway, so the case of having 
multiple recoveries running on the same node in parallel should be 
fairly rare too,


Steve.




This patch makes the recovery workqueue operate on a per-superblock basis,
which means a mount point using, for example journal0, could do recovery
for all journals that need recovery. This is done consecutively so the
sd_log_bio is only referenced by one recovery at a time, thus avoiding the
chaos.

Since the journal recovery requests can come in any order, and unpredictably,
the new work func loops until there are no more journals to be recovered.

Since multiple processes may request recovery of a journal, and since they
all now use the same sdp-based workqueue, it's okay for them to get an
error from queue_work: Queueing work while there's already work queued.

Signed-off-by: Bob Peterson 
---
  fs/gfs2/incore.h |  2 +-
  fs/gfs2/ops_fstype.c |  2 +-
  fs/gfs2/recovery.c   | 32 
  3 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 8e1ab8ed4abc..b393cbf9efeb 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -529,7 +529,6 @@ struct gfs2_jdesc {
struct list_head jd_list;
struct list_head extent_list;
unsigned int nr_extents;
-   struct work_struct jd_work;
struct inode *jd_inode;
unsigned long jd_flags;
  #define JDF_RECOVERY 1
@@ -746,6 +745,7 @@ struct gfs2_sbd {
struct completion sd_locking_init;
struct completion sd_wdack;
struct delayed_work sd_control_work;
+   struct work_struct sd_recovery_work;
  
  	/* Inode Stuff */
  
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c

index 61fce59cb4d3..3d9a6d6d42cb 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -93,6 +93,7 @@ static struct gfs2_sbd *init_sbd(struct super_block *sb)
init_completion(>sd_locking_init);
init_completion(>sd_wdack);
spin_lock_init(>sd_statfs_spin);
+   INIT_WORK(>sd_recovery_work, gfs2_recover_func);
  
  	spin_lock_init(>sd_rindex_spin);

sdp->sd_rindex_tree.rb_node = NULL;
@@ -586,7 +587,6 @@ static int gfs2_jindex_hold(struct gfs2_sbd *sdp, struct 
gfs2_holder *ji_gh)
INIT_LIST_HEAD(>extent_list);
INIT_LIST_HEAD(>jd_revoke_list);
  
-		INIT_WORK(>jd_work, gfs2_recover_func);

jd->jd_inode = gfs2_lookupi(sdp->sd_jindex, , 1);
if (IS_ERR_OR_NULL(jd->jd_inode)) {
if (!jd->jd_inode)
diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index c26c68ebd29d..cd3e66cdb560 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -399,9 +399,8 @@ static void recover_local_statfs(struct gfs2_jdesc *jd,
return;
  }
  
-void gfs2_recover_func(struct work_struct *work)

+static void gfs2_recover_one(struct gfs2_jdesc *jd)
  {
-   struct gfs2_jdesc *jd = container_of(work, struct gfs2_jdesc, jd_work);
struct gfs2_inode *ip = GFS2_I(jd->jd_inode);
struct gfs2_sbd *sdp = GFS2_SB(jd->jd_inode);
struct gfs2_log_header_host head;
@@ -562,16 +561,41 @@ void gfs2_recover_func(struct work_struct *work)
wake_up_bit(>jd_flags, JDF_RECOVERY);
  }
  
+void gfs2_recover_func(struct work_struct *work)

+{
+   struct gfs2_sbd *sdp = container_of(work, struct gfs2_sbd,
+   sd_recovery_work);
+   struct gfs2_jdesc *jd;
+   int count, recovered = 0;
+
+   do {
+   count = 0;
+   spin_lock(>sd_jindex_spin);
+   list_for_each_entry(jd, >sd_jindex_list, jd_list) {
+   if (test_bit(JDF_RECOVERY, >jd_flags)) {
+   spin_unlock(>sd_jindex_spin);
+   gfs2_recover_one(jd);
+   spin_lock(>sd_jindex_spin);
+   count++;
+   recovered++;
+   }
+   }
+   spin_unlock(>sd_jindex_spin);
+   } while (count);
+   if (recovered > 1)
+   fs_err(sdp, "Journals recovered: %d\n", recovered);
+}
+
  int gfs2_recover_journal(struct gfs2_jdesc *jd, bool wait)
  {
+   struct gfs2_sbd *sdp = GFS2_SB(jd->jd_inode);
int rv;
  
  	if (test_and_set_bit(JDF_RECOVERY, >jd_flags))

return -EBUSY;
  
  	

Re: [Cluster-devel] [PATCH 07/12] gfs2: Get rid of on-stack transactions

2020-12-14 Thread Steven Whitehouse

Hi,

On 14/12/2020 14:02, Bob Peterson wrote:

Hi,

- Original Message -

+   ret = __gfs2_trans_begin(sdp, 0, revokes, GFP_NOFS | __GFP_NOFAIL);

The addition of __GFP_NOFAIL means that this operation can now block.
Looking at the code, I don't think it will be a problem because it can
already block in the log_flush operations that precede it, but it
makes me nervous. Obviously, we need to test this really well.

Bob

Not sure of the context here exactly, but why are we adding an instance 
of __GFP_NOFAIL? There is already a return code there so that we can 
fail in that case if required,


Steve.




Re: [Cluster-devel] Recording extents in GFS2

2020-12-14 Thread Steven Whitehouse

Hi,

On 11/12/2020 16:38, Abhijith Das wrote:

Hi all,

With a recent set of patches, we nearly eliminated the per_node statfs
change files by recording that info in the journal. The files and some
recovery code remain only for backward compatibility. Similarly, I'd
like to get rid of the per_node quota change files and record that
info in the journal as well.

I've been talking to Andreas and Bob a bit about this and I'm
investigating how we can record extents as we allocate and deallocate
blocks instead of writing whole blocks. I'm looking into how XFS does
this.

We could have a new journal block type that adds a list of extents to
inodes with alloc/dealloc info. We could add in quota (uid/gid) info
to this as well. If we can do this right, the representation of
alloc/dealloc becomes compact and consequently we use journal space
more efficiently. We can hopefully avoid cases where we need to zero
out blocks during allocation as well.

I'm sending this out to start a discussion and to get ideas/comments/pointers.

Cheers!
--Abhi

I think you need to propose something a bit more concrete. For example 
what will the data structures look like? How many entries will fit in a 
journal block at different block sizes? How will we ensure that this is 
backwards compatible? That will make it easier to have the discussions,


Steve.




Re: [Cluster-devel] [PATCH] gfs2: Take inode glock exclusively when mounted without noatime

2020-11-25 Thread Steven Whitehouse

Hi,

On 24/11/2020 16:42, Andreas Gruenbacher wrote:

Commit 20f82c38 ("gfs2: Rework read and page fault locking") has lifted the
glock lock taking from the low-level ->readpage and ->readahead address space
operations to the higher-level ->read_iter file and ->fault vm operations.  The
glocks are still taken in LM_ST_SHARED mode only.  On filesystems mounted
without the noatime option, ->read_iter needs to update the atime as well
though, so we currently run into a failed locking mode assertion in
gfs2_dirty_inode.  Fix that by taking the glock in LM_ST_EXCLUSIVE mode on
filesystems mounted without the noatime mount option.

Faulting in pages doesn't update the atime, so in the ->fault vm operation,
taking the glock in LM_ST_SHARED mode is enough.


I don't think this makes any sense to do. It is going to reduce the 
scalibility quite a lot I suspect. Even if you have multiple nodes 
reading a file, the atime updates would not be synchronous with the 
reads, so why insist on an exclusive lock here?


Steve.




Reported-by: Alexander Aring 
Fixes: 20f82c38 ("gfs2: Rework read and page fault locking")
Cc: sta...@vger.kernel.org # v5.8+
Signed-off-by: Andreas Gruenbacher 

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index b39b339feddc..162a81873dcd 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -849,6 +849,7 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, 
struct iov_iter *to)
struct gfs2_inode *ip;
struct gfs2_holder gh;
size_t written = 0;
+   unsigned int state;
ssize_t ret;
  
  	if (iocb->ki_flags & IOCB_DIRECT) {

@@ -871,7 +872,8 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, 
struct iov_iter *to)
return ret;
}
ip = GFS2_I(iocb->ki_filp->f_mapping->host);
-   gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, );
+   state = IS_NOATIME(>i_inode) ? LM_ST_SHARED : LM_ST_EXCLUSIVE;
+   gfs2_holder_init(ip->i_gl, state, 0, );
ret = gfs2_glock_nq();
if (ret)
goto out_uninit;





Re: [Cluster-devel] [GFS2 PATCH 06/12] gfs2: Create transaction for inodes with i_nlink != 0

2020-08-27 Thread Steven Whitehouse

Hi,

On 27/08/2020 07:00, Andreas Gruenbacher wrote:

On Fri, Aug 21, 2020 at 7:33 PM Bob Peterson  wrote:

Before this patch, function gfs2_evict_inode would check if i_nlink
was non-zero, and if so, go to label out. The problem is, the evicted
file may still have outstanding pages that need invalidating, but
the call to truncate_inode_pages_final at label out doesn't start a
transaction. It needs a transaction in order to write revokes for any
pages it has to invalidate.

This is only true for jdata inodes though, right? If so, I'd rather
just create transactions in the jdata case.


Yes, and also if the inode is being deallocated, then we might be able 
to skip that step. We'll no doubt have to retain it in case this is just 
an unlink and there are still openers somewhere,


Steve.



This patch removes the early check for i_nlink in gfs2_evict_inode.
Not much further down in the code, there's another check for i_nlink
that skips to out_truncate. That one is proper because the calls
to truncate_inode_pages after out_truncate use a proper transaction,
so the page invalidates and subsequent revokes may be done properly.

Signed-off-by: Bob Peterson 
---
  fs/gfs2/super.c | 21 +
  1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 80ac446f0110..1f3dee740431 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1344,7 +1344,7 @@ static void gfs2_evict_inode(struct inode *inode)
 return;
 }

-   if (inode->i_nlink || sb_rdonly(sb))
+   if (sb_rdonly(sb))
 goto out;
 if (test_bit(GIF_ALLOC_FAILED, >i_flags)) {
@@ -1370,15 +1370,19 @@ static void gfs2_evict_inode(struct inode *inode)
 }

 if (gfs2_inode_already_deleted(ip->i_gl, ip->i_no_formal_ino))
-   goto out_truncate;
+   goto out_flush;
 error = gfs2_check_blk_type(sdp, ip->i_no_addr, GFS2_BLKST_UNLINKED);
-   if (error)
-   goto out_truncate;
+   if (error) {
+   error = 0;
+   goto out_flush;
+   }

 if (test_bit(GIF_INVALID, >i_flags)) {
 error = gfs2_inode_refresh(ip);
-   if (error)
-   goto out_truncate;
+   if (error) {
+   error = 0;
+   goto out_flush;
+   }
 }

 /*
@@ -1392,7 +1396,7 @@ static void gfs2_evict_inode(struct inode *inode)
 test_bit(HIF_HOLDER, >i_iopen_gh.gh_iflags)) {
 if (!gfs2_upgrade_iopen_glock(inode)) {
 gfs2_holder_uninit(>i_iopen_gh);
-   goto out_truncate;
+   goto out_flush;
 }
 }

@@ -1424,7 +1428,7 @@ static void gfs2_evict_inode(struct inode *inode)
 gfs2_inode_remember_delete(ip->i_gl, ip->i_no_formal_ino);
 goto out_unlock;

-out_truncate:
+out_flush:
 gfs2_log_flush(sdp, ip->i_gl, GFS2_LOG_HEAD_FLUSH_NORMAL |
GFS2_LFC_EVICT_INODE);
 metamapping = gfs2_glock2aspace(ip->i_gl);
@@ -1435,6 +1439,7 @@ static void gfs2_evict_inode(struct inode *inode)
 write_inode_now(inode, 1);
 gfs2_ail_flush(ip->i_gl, 0);

+out_truncate:
 nr_revokes = inode->i_mapping->nrpages + metamapping->nrpages;
 if (!nr_revokes)
 goto out_unlock;
--
2.26.2


Thanks,
Andreas





Re: [Cluster-devel] [PATCH 1/3] gfs2: Don't write updates to local statfs file

2020-08-20 Thread Steven Whitehouse

Hi,

On 20/08/2020 12:04, Abhijith Das wrote:



On Wed, Aug 19, 2020 at 12:07 PM Bob Peterson > wrote:


- Original Message -
> We store the local statfs info in the journal header now so
> there's no need to write to the local statfs file anymore.
>
> Signed-off-by: Abhi Das mailto:a...@redhat.com>>
> ---
>  fs/gfs2/lops.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
> index cb2a11b458c6..53d2dbf6605e 100644
> --- a/fs/gfs2/lops.c
> +++ b/fs/gfs2/lops.c
> @@ -104,7 +104,15 @@ static void gfs2_unpin(struct gfs2_sbd
*sdp, struct
> buffer_head *bh,
>       BUG_ON(!buffer_pinned(bh));
>
>       lock_buffer(bh);
> -     mark_buffer_dirty(bh);
> +     /*
> +      * We want to eliminate the local statfs file eventually.
> +      * But, for now, we're simply not going to update it by
> +      * never marking its buffers dirty
> +      */
> +     if (!(bd->bd_gl->gl_name.ln_type == LM_TYPE_INODE &&
> +           bd->bd_gl->gl_object == GFS2_I(sdp->sd_sc_inode)))
> +             mark_buffer_dirty(bh);
> +
>       clear_buffer_pinned(bh);
>
>       if (buffer_is_rgrp(bd))
> --
> 2.20.1

Hi,

This seems dangerous to me. It can only get to gfs2_unpin by trying to
commit buffers for a transaction. If the buffers aren't marked dirty,
that means transactions will be queued to the ail1 list that won't be
fully written. So what happens to them? Do they eventually get freed?

I'm also concerned about a potential impact to performance, since
gfs2_unpin gets called with every metadata buffer that's used.
The additional if checks may not costs us much time-wise, but it's a
pretty hot function.

Can't we accomplish the same thing by making function update_statfs()
never add the buffers to the transaction in the first place?
IOW, by just removing the line:
        gfs2_trans_add_meta(m_ip->i_gl, m_bh);
That way we don't need to worry about its buffer getting pinned,
unpinned and queued to the ail.

Regards,

Bob Peterson

Fair point. I'll post an updated version of this patch that doesn't 
queue the buffer in the first place.


Cheers!
--Abhi


You need to think about correctness at recovery time. It may be faster 
to not write the data into the journal for the local statfs file, but 
how will that affect recovery depending on whether that recovery is 
performed by either a newer or older kernel? Being backwards compatible 
might be more important in this case, so worth looking at carefully,


Steve.




Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-09 Thread Steven Whitehouse

Hi,

On 08/07/2020 17:54, Christoph Hellwig wrote:

On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:

Direct I/O isn't deterministic though.  If the file isn't shared, then
it works great, but as soon as you get mixed buffered and direct I/O,
everything is already terrible.  Direct I/Os perform pagecache lookups
already, but instead of using the data that we found in the cache, we
(if it's dirty) write it back, wait for the write to complete, remove
the page from the pagecache and then perform another I/O to get the data
that we just wrote out!  And then the app that's using buffered I/O has
to read it back in again.

Mostly agreed.  That being said I suspect invalidating clean cache
might still be a good idea.  The original idea was mostly on how
to deal with invalidation failures of any kind, but falling back for
any kind of dirty cache also makes at least some sense.


I have had an objection raised off-list.  In a scenario with a block
device shared between two systems and an application which does direct
I/O, everything is normally fine.  If one of the systems uses tar to
back up the contents of the block device then the application on that
system will no longer see the writes from the other system because
there's nothing to invalidate the pagecache on the first system.

Err, WTF?  If someone access shared block devices with random
applications all bets are off anyway.


On GFS2 the locking should take care of that. Not 100% sure about OCFS2 
without looking, but I'm fairly sure that they have a similar 
arrangement. So this shouldn't be a problem unless there is an 
additional cluster fs that I'm not aware of that they are using in this 
case. It would be good to confirm which fs they are using,


Steve.




Re: [Cluster-devel] Disentangling address_space and inode

2020-06-10 Thread Steven Whitehouse

Hi,

On 09/06/2020 13:41, Matthew Wilcox wrote:

I have a modest proposal ...

  struct inode {
-   struct address_space i_data;
  }

+struct minode {
+   struct inode i;
+   struct address_space m;
+};

  struct address_space {
-   struct inode *host;
  }

This saves one pointer per inode, and cuts all the pagecache support
from inodes which don't need to have a page cache (symlinks, directories,
pipes, sockets, char devices).

This was born from the annoyance of going from a struct page to a filesystem:
page->mapping->host->i_sb->s_type

That's four pointer dereferences.  This would bring it down to three:
i_host(page->mapping)->i_sb->s_type

I could see (eventually) interfaces changing to pass around a
struct minode *mapping instead of a struct address_space *mapping.  But
I know mapping->host and inode->i_mapping sometimes have some pretty
weird relationships and maybe there's a legitimate usage that can't be
handled by this change.

Every filesystem which does use the page cache would have to be changed
to use a minode instead of an inode, which is why this proposal is so
very modest.  But before I start looking into it properly, I thought
somebody might know why this isn't going to work.

I know about raw devices:
 file_inode(filp)->i_mapping =
 bdev->bd_inode->i_mapping;

and this seems like it should work for that.  I know about coda:
 coda_inode->i_mapping = host_inode->i_mapping;

and this seems like it should work there too.

DAX just seems confused:
 inode->i_mapping = __dax_inode->i_mapping;
 inode->i_mapping->host = __dax_inode;
 inode->i_mapping->a_ops = _dax_aops;

GFS2 might need to embed an entire minode instead of just a mapping in its
glocks and its superblock:
fs/gfs2/glock.c:mapping->host = s->s_bdev->bd_inode;
fs/gfs2/ops_fstype.c:   mapping->host = sb->s_bdev->bd_inode;


I don't think that will scale. We did gain a big reduction in overhead 
for each cached inode when we stopped using two struct inodes and just 
embedded an address_space in the glock. However, I'm fairly sure that 
for the glock address_space case, we already have our own way to find 
the associated inode. So it might well be ok to do this anyway, and not 
need to embed a full minode.


Also, if there was a better way to track metadata on a per inode basis, 
then that would be an even better solution, but a much bigger project too.


The issue that you might run across is for stacked filesystems... will 
you land up finding the correct layer in the stack?


Steve.




NILFS ... I don't understand at all.  It seems to allocate its own
private address space in nilfs_inode_info instead of using i_data (why?)
and also allocate more address spaces for metadata inodes.
fs/nilfs2/page.c:   mapping->host = inode;

So that will need to be understood, but is there a fundamental reason
this won't work?

Advantages:
  - Eliminates a pointer dereference when moving from mapping to host
  - Shrinks all inodes by one pointer
  - Shrinks inodes used for symlinks, directories, sockets, pipes & char
devices by an entire struct address_space.

Disadvantages:
  - Churn
  - Seems like it'll grow a few data structures in less common filesystems
(but may be important for some users)





Re: [Cluster-devel] [GFS2 PATCH] gfs2: fix trans slab error when withdraw occurs inside log_flush

2020-06-09 Thread Steven Whitehouse

Hi,

On 09/06/2020 14:55, Bob Peterson wrote:

Hi,

Log flush operations (gfs2_log_flush()) can target a specific transaction.
But if the function encounters errors (e.g. io errors) and withdraws,
the transaction was only freed it if was queued to one of the ail lists.
If the withdraw occurred before the transaction was queued to the ail1
list, function ail_drain never freed it. The result was:

BUG gfs2_trans: Objects remaining in gfs2_trans on __kmem_cache_shutdown()

This patch makes log_flush() add the targeted transaction to the ail1
list so that function ail_drain() will find and free it properly.

Signed-off-by: Bob Peterson 
---
  fs/gfs2/log.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 3e4734431783..2b05415bbc13 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -1002,6 +1002,16 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct 
gfs2_glock *gl, u32 flags)
  
  out:

if (gfs2_withdrawn(sdp)) {
+   /**
+* If the tr_list is empty, we're withdrawing during a log
+* flush that targets a transaction, but the transaction was
+* never queued onto any of the ail lists. Here we add it to
+* ail1 just so that ail_drain() will find and free it.
+*/
+   spin_lock(>sd_ail_lock);
+   if (tr && list_empty(>tr_list))
+   list_add(>tr_list, >sd_ail1_list);
+   spin_unlock(>sd_ail_lock);
ail_drain(sdp); /* frees all transactions */
tr = NULL;
}

I'm not sure quite what the aim is here... are you sure that it is ok to 
move something to the AIL list if there has been a withdrawal? If the 
log flush has not completed correctly then we should not be moving 
anything to the AIL lists I think,


Steve.




Re: [Cluster-devel] [PATCH] mkfs.gfs2: Don't use optimal_io_size when equal to minimum

2020-05-27 Thread Steven Whitehouse

Hi,

On 27/05/2020 11:02, Andrew Price wrote:

On 27/05/2020 09:53, Steven Whitehouse wrote:

Hi,

On 27/05/2020 09:29, Andrew Price wrote:
Some devices report an optimal_io_size of 512 instead of 0 when it's 
not

larger than the minimum_io_size. Currently mkfs.gfs2 uses the non-zero
value to choose the block size, which is almost certainly not what we
want when it's 512. Update the suitability check for optimal_io_size to
avoid using it when it's the same as minimum_io_size.  The effect is
that we fall back to using the default block size, 4096.

Resolves: rhbz#1839219
Signed-off-by: Andrew Price 


What about for other sizes? We don't really want to select a block 
size to be anything other than 4k in most cases, even if the block 
device offers a lower minimum/optimal I/O size,


I think it would be unusual for a device to have an optimal_io_size > 
512 and < 4K, and I expect it's only 512 for the tested device because 
of some design or configuration error, so we're still covering most 
cases anyway. I figured that any other optimal_io_size is likely to be 
one that's there for a good reason and so we should probably use it 
(if suitable).


That's just my reasoning for this patch though. I can see the value in 
ignoring optimal_io_size < 4K, but it poses the question of whether 
there's any value in allowing mkfs.gfs2 to create < 4K block size 
filesystems at all.


Andy

By default, I can't see any reason why we'd want a block sizes less than 
4k. We might want to allow someone to do that for special cases, but 
generally the lower block sizes cause issue with larger file sizes, due 
to the increased height of the metadata tree. As such we should try and 
avoid them, and ignoring all hints of below 4k seems like a sensible plan.


If someone specifically requests a smaller block size on the command 
line, then that is another thing, but we should try and protect people 
from devices which advertise really small optimal I/O sizes. Really we 
should be using that in combination with the alignment information when 
laying out the larger structures on disk, and not using it for selecting 
the block size - assuming again that these sizes have been set by the 
device to something sensible in the first place,


Steve.





---
  gfs2/mkfs/main_mkfs.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gfs2/mkfs/main_mkfs.c b/gfs2/mkfs/main_mkfs.c
index 846b341f..8b97f3d2 100644
--- a/gfs2/mkfs/main_mkfs.c
+++ b/gfs2/mkfs/main_mkfs.c
@@ -505,7 +505,7 @@ static unsigned choose_blocksize(struct 
mkfs_opts *opts)

  }
  if (!opts->got_bsize && got_topol) {
  if (dev->optimal_io_size <= getpagesize() &&
-    dev->optimal_io_size >= dev->minimum_io_size)
+    dev->optimal_io_size > dev->minimum_io_size)
  bsize = dev->optimal_io_size;
  else if (dev->physical_sector_size <= getpagesize() &&
   dev->physical_sector_size >= GFS2_DEFAULT_BSIZE)





Re: [Cluster-devel] [PATCH] mkfs.gfs2: Don't use optimal_io_size when equal to minimum

2020-05-27 Thread Steven Whitehouse

Hi,

On 27/05/2020 09:29, Andrew Price wrote:

Some devices report an optimal_io_size of 512 instead of 0 when it's not
larger than the minimum_io_size. Currently mkfs.gfs2 uses the non-zero
value to choose the block size, which is almost certainly not what we
want when it's 512. Update the suitability check for optimal_io_size to
avoid using it when it's the same as minimum_io_size.  The effect is
that we fall back to using the default block size, 4096.

Resolves: rhbz#1839219
Signed-off-by: Andrew Price 


What about for other sizes? We don't really want to select a block size 
to be anything other than 4k in most cases, even if the block device 
offers a lower minimum/optimal I/O size,


Steve.



---
  gfs2/mkfs/main_mkfs.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gfs2/mkfs/main_mkfs.c b/gfs2/mkfs/main_mkfs.c
index 846b341f..8b97f3d2 100644
--- a/gfs2/mkfs/main_mkfs.c
+++ b/gfs2/mkfs/main_mkfs.c
@@ -505,7 +505,7 @@ static unsigned choose_blocksize(struct mkfs_opts *opts)
}
if (!opts->got_bsize && got_topol) {
if (dev->optimal_io_size <= getpagesize() &&
-   dev->optimal_io_size >= dev->minimum_io_size)
+   dev->optimal_io_size > dev->minimum_io_size)
bsize = dev->optimal_io_size;
else if (dev->physical_sector_size <= getpagesize() &&
 dev->physical_sector_size >= GFS2_DEFAULT_BSIZE)




Re: [Cluster-devel] [PATCH] Move struct gfs2_rgrp_lvb out of gfs2_ondisk.h

2020-01-15 Thread Steven Whitehouse

Hi,

On 15/01/2020 09:24, Andreas Gruenbacher wrote:

On Wed, Jan 15, 2020 at 9:58 AM Steven Whitehouse  wrote:

On 15/01/2020 08:49, Andreas Gruenbacher wrote:

There's no point in sharing the internal structure of lock value blocks
with user space.

The reason that is in ondisk is that changing that structure is
something that needs to follow the same rules as changing the on disk
structures. So it is there as a reminder of that,

I can see a point in that. The reason I've posted this is because Bob
was complaining that changes to include/uapi/linux/gfs2_ondisk.h break
his out-of-tree module build process. (One of the patches I'm working
on adds an inode LVB.) The same would be true of on-disk format
changes as well of course, and those definitely need to be shared with
user space. I'm not usually building gfs2 out of tree, so I'm
indifferent to this change.

Thanks,
Andreas

Why would we need to be able to build gfs2 (at least I assume it is 
gfs2) out of tree anyway?


Steve.




Re: [Cluster-devel] [PATCH] Move struct gfs2_rgrp_lvb out of gfs2_ondisk.h

2020-01-15 Thread Steven Whitehouse

Hi,

On 15/01/2020 08:49, Andreas Gruenbacher wrote:

There's no point in sharing the internal structure of lock value blocks
with user space.


The reason that is in ondisk is that changing that structure is 
something that needs to follow the same rules as changing the on disk 
structures. So it is there as a reminder of that,


Steve.




Signed-off-by: Andreas Gruenbacher 
---
  fs/gfs2/glock.h  |  1 +
  fs/gfs2/incore.h |  1 +
  fs/gfs2/rgrp.c   | 10 ++
  include/uapi/linux/gfs2_ondisk.h | 10 --
  4 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index b8adaf80e4c5..d2f2dba05a94 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -306,4 +306,5 @@ static inline void glock_clear_object(struct gfs2_glock 
*gl, void *object)
spin_unlock(>gl_lockref.lock);
  }
  
+

  #endif /* __GLOCK_DOT_H__ */
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index b5d9c11f4901..5155389e9b5c 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -33,6 +33,7 @@ struct gfs2_trans;
  struct gfs2_jdesc;
  struct gfs2_sbd;
  struct lm_lockops;
+struct gfs2_rgrp_lvb;
  
  typedef void (*gfs2_glop_bh_t) (struct gfs2_glock *gl, unsigned int ret);
  
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c

index 2466bb44a23c..1165627274cf 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -46,6 +46,16 @@
  #define LBITSKIP00 (0xUL)
  #endif
  
+struct gfs2_rgrp_lvb {

+   __be32 rl_magic;
+   __be32 rl_flags;
+   __be32 rl_free;
+   __be32 rl_dinodes;
+   __be64 rl_igeneration;
+   __be32 rl_unlinked;
+   __be32 __pad;
+};
+
  /*
   * These routines are used by the resource group routines (rgrp.c)
   * to keep track of block allocation.  Each block is represented by two
diff --git a/include/uapi/linux/gfs2_ondisk.h b/include/uapi/linux/gfs2_ondisk.h
index 2dc10a034de1..4e9a80941bec 100644
--- a/include/uapi/linux/gfs2_ondisk.h
+++ b/include/uapi/linux/gfs2_ondisk.h
@@ -171,16 +171,6 @@ struct gfs2_rindex {
  #define GFS2_RGF_NOALLOC  0x0008
  #define GFS2_RGF_TRIMMED  0x0010
  
-struct gfs2_rgrp_lvb {

-   __be32 rl_magic;
-   __be32 rl_flags;
-   __be32 rl_free;
-   __be32 rl_dinodes;
-   __be64 rl_igeneration;
-   __be32 rl_unlinked;
-   __be32 __pad;
-};
-
  struct gfs2_rgrp {
struct gfs2_meta_header rg_header;
  




Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

2019-11-27 Thread Steven Whitehouse

Hi,

On 25/11/2019 17:05, Linus Torvalds wrote:

On Mon, Nov 25, 2019 at 2:53 AM Steven Whitehouse  wrote:

Linus, is that roughly what you were thinking of?

So the concept looks ok, but I don't really like the new flags as they
seem to be gfs2-specific rather than generic.

That said, I don't _gate_ them either, since they aren't in any
critical code sequence, and it's not like they do anything really odd.

I still think the whole gfs2 approach is broken. You're magically ok
with using stale data from the cache just because it's cached, even if
another client might have truncated the file or something.


If another node tries to truncate the file, that will require an 
exclusive glock, and in turn that means the all the other nodes will 
have to drop their glock(s) shared or exclusive. That process 
invalidates the page cache on those nodes, such that any further 
requests on those nodes will find the cache empty and have to call into 
the filesystem.


If a page is truncated on another node, then when the local node gives 
up its glock, after any copying (e.g. for read) has completed then the 
truncate will take place. The local node will then have to reread any 
data relating to new pages or return an error in case the next page to 
be read has vanished due to the truncate. It is a pretty small window, 
and the advantage is that in cases where the page is in cache, we can 
directly use the cached page without having to call into the filesystem 
at all. So it is page atomic in that sense.


The overall aim here is to avoid taking (potentially slow) cluster locks 
when at all possible, yet at the same time deliver close to local fs 
semantics whenever we can. You can think of GFS2's glock concept (at 
least as far as the inodes we are discussing here) as providing a 
combination of (page) cache control and cluster (dlm) locking.




So you're ok with saying "the file used to be X bytes in size, so
we'll just give you this data because we trust that the X is correct".

But you're not ok to say "oh, the file used to be X bytes in size, but
we don't want to give you a short read because it might not be correct
any more".

See the disconnect? You trust the size in one situation, but not in another one.


Well we are not trusting the size at all... the original algorithm 
worked entirely off "is this page in cache and uptodate?" and for 
exactly the reason that we know the size in the inode might be out of 
date, if we are not currently holding a glock in either shared or 
exclusive mode. We also know that if there is a page in cache and 
uptodate then we must be holding the glock too.





I also don't really see that you *need* the new flag at all. Since
you're doing to do a speculative read and then a real read anyway, and
since the only thing that you seem to care about is the file size
(because the *data* you will trust if it is cached), then why don't
you just use the *existing* generic read, and *IFF* you get a
truncated return value, then you go and double-check that the file
hasn't changed in size?

See what I'm saying? I think gfs2 is being very inconsistent in when
it trusts the file size, and I don't see that you even need the new
behavior that patch gives, because you might as well just use the
existing code (just move the i_size check earlier, and then teach gfs2
to double-check the "I didn't get as much as I expected" case).

  Linus


I'll leave the finer details to Andreas here, since it is his patch, and 
hopefully we can figure out a good path forward. We are perhaps also a 
bit reluctant to go off and (nearly) duplicate code that is already in 
the core vfs library functions, since that often leads to things getting 
out of sync (our implementation of ->writepages is one case where that 
happened in the past) and missing important bug fixes/features in some 
cases. Hopefully though we can iterate on this a bit and come up with 
something which will resolve all the issues,


Steve.








Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

2019-11-25 Thread Steven Whitehouse

Hi,

On 22/11/2019 23:59, Andreas Grünbacher wrote:

Hi,

Am Do., 31. Okt. 2019 um 12:43 Uhr schrieb Steven Whitehouse
:

Andreas, Bob, have I missed anything here?

I've looked into this a bit, and it seems that there's a reasonable
way to get rid of the lock taking in ->readpage and ->readpages
without a lot of code duplication. My proposal for that consists of
multiple patches, so I've posted it separately:

https://lore.kernel.org/linux-fsdevel/20191122235324.17245-1-agrue...@redhat.com/T/#t

Thanks,
Andreas


Andreas, thanks for taking a look at this.

Linus, is that roughly what you were thinking of?

Ronnie, Steve, can the same approach perhaps work for CIFS?

Steve.







Re: [Cluster-devel] [PATCH 00/32] gfs2: misc recovery patch collection

2019-11-14 Thread Steven Whitehouse

Hi,

There are a lot of useful fixes in this series. We should consider how 
many of them should go to -stable I think. Also we should start to get 
them integrated upstream. Might be a good plan to sort out the more 
obvious ones and send those right away, and then do anything which needs 
a bit more review in a second pass,


Steve.

On 13/11/2019 21:29, Bob Peterson wrote:

This is my latest collection of patches to address the myriad of gfs2
recovery problems I've found. I'm not convinced we need all of these
but I thought I'd send them anyway and get feedback

Some of these are just bugs and may be pushed separately.

Bob Peterson (32):
   gfs2: Introduce concept of a pending withdraw
   gfs2: clear ail1 list when gfs2 withdraws
   gfs2: Rework how rgrp buffer_heads are managed
   gfs2: fix infinite loop in gfs2_ail1_flush on io error
   gfs2: log error reform
   gfs2: Only complain the first time an io error occurs in quota or log
   gfs2: Ignore dlm recovery requests if gfs2 is withdrawn
   gfs2: move check_journal_clean to util.c for future use
   gfs2: Allow some glocks to be used during withdraw
   gfs2: Don't loop forever in gfs2_freeze if withdrawn
   gfs2: Make secondary withdrawers wait for first withdrawer
   gfs2: Don't write log headers after file system withdraw
   gfs2: Force withdraw to replay journals and wait for it to finish
   gfs2: fix infinite loop when checking ail item count before go_inval
   gfs2: Add verbose option to check_journal_clean
   gfs2: Abort gfs2_freeze if io error is seen
   gfs2: Issue revokes more intelligently
   gfs2: Prepare to withdraw as soon as an IO error occurs in log write
   gfs2: Check for log write errors before telling dlm to unlock
   gfs2: new slab for transactions
   gfs2: Close timing window with GLF_INVALIDATE_IN_PROGRESS
   gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty
   gfs2: Don't skip log flush if glock still has revokes
   gfs2: initialize tr_ail1_list when creating transactions
   gfs2: Withdraw in gfs2_ail1_flush if write_cache_pages returns error
   gfs2: drain the ail2 list after io errors
   gfs2: make gfs2_log_shutdown static
   gfs2: Eliminate GFS2_RDF_UPTODATE flag in favor of buffer existence
   gfs2: if finish_open returns error, clean up iopen glock mess
   gfs2: Don't demote a glock until its revokes are written
   gfs2: Do proper error checking for go_sync family of glops functions
   gfs2: fix glock reference problem in gfs2_trans_add_unrevoke

  fs/gfs2/aops.c   |   4 +-
  fs/gfs2/file.c   |   2 +-
  fs/gfs2/glock.c  | 140 ++
  fs/gfs2/glops.c  | 153 ++--
  fs/gfs2/incore.h |  21 ++--
  fs/gfs2/inode.c  |   6 ++
  fs/gfs2/lock_dlm.c   |  52 ++
  fs/gfs2/log.c| 231 +-
  fs/gfs2/log.h|   2 +-
  fs/gfs2/lops.c   |  12 ++-
  fs/gfs2/main.c   |  23 +
  fs/gfs2/meta_io.c|   6 +-
  fs/gfs2/ops_fstype.c |  51 +-
  fs/gfs2/quota.c  |  10 +-
  fs/gfs2/recovery.c   |   5 +
  fs/gfs2/rgrp.c   |  82 +--
  fs/gfs2/rgrp.h   |   1 -
  fs/gfs2/super.c  |  97 --
  fs/gfs2/sys.c|   2 +-
  fs/gfs2/trans.c  |  38 +--
  fs/gfs2/trans.h  |   1 +
  fs/gfs2/util.c   | 235 +--
  fs/gfs2/util.h   |  16 +++
  23 files changed, 924 insertions(+), 266 deletions(-)





Re: [Cluster-devel] [PATCH 30/32] gfs2: Don't demote a glock until its revokes are written

2019-11-14 Thread Steven Whitehouse

Hi,

On 13/11/2019 21:30, Bob Peterson wrote:

Before this patch, run_queue would demote glocks based on whether
there are any more holders. But if the glock has pending revokes that
haven't been written to the media, giving up the glock might end in
file system corruption if the revokes never get written due to
io errors, node crashes and fences, etc. In that case, another node
will replay the metadata blocks associated with the glock, but
because the revoke was never written, it could replay that block
even though the glock had since been granted to another node who
might have made changes.

This patch changes the logic in run_queue so that it never demotes
a glock until its count of pending revokes reaches zero.

Signed-off-by: Bob Peterson 


I'm not sure this makes sense... if we demote the glock then the revokes 
should be written out during that process. So if that is not happening 
it is a bug. I don't think we should need to change the logic for 
deciding what we are going to demote?


Steve.


---
  fs/gfs2/glock.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index ab72797e3ba1..082f70eb96db 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -712,6 +712,9 @@ __acquires(>gl_lockref.lock)
goto out_unlock;
if (nonblock)
goto out_sched;
+   smp_mb();
+   if (atomic_read(>gl_revokes) != 0)
+   goto out_sched;
set_bit(GLF_DEMOTE_IN_PROGRESS, >gl_flags);
GLOCK_BUG_ON(gl, gl->gl_demote_state == LM_ST_EXCLUSIVE);
gl->gl_target = gl->gl_demote_state;




Re: [Cluster-devel] [PATCH 28/32] gfs2: Eliminate GFS2_RDF_UPTODATE flag in favor of buffer existence

2019-11-14 Thread Steven Whitehouse

Hi,

On 13/11/2019 21:30, Bob Peterson wrote:

Before this patch, the rgrp code used two different methods to check
if the rgrp information was up-to-date: (1) The GFS2_RDF_UPTODATE flag
in the rgrp and (2) the existence (or not) of valid buffer_head
pointers in the first bitmap. When the buffer_heads are read in from
media, the rgrp is, by definition, up to date. When the rgrp glock is
invalidated, the buffer_heads are released, thereby indicating the
rgrp is no longer up to date (another node may have changed it).
So we don't need both of these flags. This patch eliminates the flag
in favor of simply checking if the buffer_head pointers exist.
This simplifies the code. It also makes it more bullet-proof:
if there are two methods, they can possibly get out of sync. With
one method, there's no way to get out of sync, and debugging is
easier.

Signed-off-by: Bob Peterson 


These are two different things... the buffer_head flags signal whether 
the buffer head is up to date with respect to what is on disk. The 
GFS2_RDF_UPTODATE flag is there to indicate whether the internal copy of 
the various fields in the resource group is up to date.


These might match depending on how the rgrp's internal copy of the 
fields is maintained, but not sure that this is guaranteed. Has this 
been tested with the rgrplvb option? We should make sure that is all 
still working correctly,


Steve.



---
  fs/gfs2/glops.c  |  3 ---
  fs/gfs2/incore.h |  1 -
  fs/gfs2/rgrp.c   | 22 +++---
  3 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 4072f37e4278..183fd7cbdbc1 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -213,9 +213,6 @@ static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
  
  	WARN_ON_ONCE(!(flags & DIO_METADATA));

truncate_inode_pages_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
-
-   if (rgd)
-   rgd->rd_flags &= ~GFS2_RDF_UPTODATE;
  }
  
  static struct gfs2_inode *gfs2_glock2inode(struct gfs2_glock *gl)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index a15ddd2f9bf4..61be366a2fa7 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -113,7 +113,6 @@ struct gfs2_rgrpd {
u32 rd_flags;
u32 rd_extfail_pt;  /* extent failure point */
  #define GFS2_RDF_CHECK0x1000 /* check for unlinked inodes 
*/
-#define GFS2_RDF_UPTODATE  0x2000 /* rg is up to date */
  #define GFS2_RDF_ERROR0x4000 /* error in rg */
  #define GFS2_RDF_PREFERRED0x8000 /* This rgrp is preferred */
  #define GFS2_RDF_MASK 0xf000 /* mask for internal flags */
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 10d3397ed3cd..e5eba83a1a42 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -939,7 +939,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
goto fail;
  
  	rgd->rd_rgl = (struct gfs2_rgrp_lvb *)rgd->rd_gl->gl_lksb.sb_lvbptr;

-   rgd->rd_flags &= ~(GFS2_RDF_UPTODATE | GFS2_RDF_PREFERRED);
+   rgd->rd_flags &= ~GFS2_RDF_PREFERRED;
if (rgd->rd_data > sdp->sd_max_rg_data)
sdp->sd_max_rg_data = rgd->rd_data;
spin_lock(>sd_rindex_spin);
@@ -1214,15 +1214,15 @@ static int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd)
}
}
  
-	if (!(rgd->rd_flags & GFS2_RDF_UPTODATE)) {

-   for (x = 0; x < length; x++)
-   clear_bit(GBF_FULL, >rd_bits[x].bi_flags);
-   gfs2_rgrp_in(rgd, (rgd->rd_bits[0].bi_bh)->b_data);
-   rgd->rd_flags |= (GFS2_RDF_UPTODATE | GFS2_RDF_CHECK);
-   rgd->rd_free_clone = rgd->rd_free;
-   /* max out the rgrp allocation failure point */
-   rgd->rd_extfail_pt = rgd->rd_free;
-   }
+   for (x = 0; x < length; x++)
+   clear_bit(GBF_FULL, >rd_bits[x].bi_flags);
+
+   gfs2_rgrp_in(rgd, (rgd->rd_bits[0].bi_bh)->b_data);
+   rgd->rd_flags |= GFS2_RDF_CHECK;
+   rgd->rd_free_clone = rgd->rd_free;
+   /* max out the rgrp allocation failure point */
+   rgd->rd_extfail_pt = rgd->rd_free;
+
if (cpu_to_be32(GFS2_MAGIC) != rgd->rd_rgl->rl_magic) {
rgd->rd_rgl->rl_unlinked = cpu_to_be32(count_unlinked(rgd));
gfs2_rgrp_ondisk2lvb(rgd->rd_rgl,
@@ -1254,7 +1254,7 @@ static int update_rgrp_lvb(struct gfs2_rgrpd *rgd)
  {
u32 rl_flags;
  
-	if (rgd->rd_flags & GFS2_RDF_UPTODATE)

+   if (rgd->rd_bits[0].bi_bh)
return 0;
  
  	if (cpu_to_be32(GFS2_MAGIC) != rgd->rd_rgl->rl_magic)




Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

2019-10-31 Thread Steven Whitehouse

Hi,

On 30/10/2019 10:54, Linus Torvalds wrote:

On Wed, Oct 30, 2019 at 11:35 AM Steven Whitehouse  wrote:

NFS may be ok here, but it will break GFS2. There may be others too...
OCFS2 is likely one. Not sure about CIFS either. Does it really matter
that we might occasionally allocate a page and then free it again?

Why are gfs2 and cifs doing things wrong?

For CIFS I've added Ronnie and Steve to common on that.

"readpage()" is not for synchrionizing metadata. Never has been. You
shouldn't treat it that way, and you shouldn't then make excuses for
filesystems that treat it that way.

Look at mmap, for example. It will do the SIGBUS handling before
calling readpage(). Same goes for the copyfile code. A filesystem that
thinks "I will update size at readpage" is already fundamentally
buggy.

We do _recheck_ the inode size under the page lock, but that's to
handle the races with truncate etc.

 Linus


For the GFS2 side of things, the algorithm looks like this:

 - Is there an uptodate page in cache?

   Yes, return it

   No, call into the fs readpage() to get one

This is designed so that for pages that are available in the page cache, 
we don't even need to call into the filesystem at all. It is all dealt 
with at the page cache level, unless the page doesn't exist. At this 
point we don't know what the i_size might be, and prior to the proposed 
patch, it simply doesn't matter, since we will ask the filesystem via 
->readpage() for all pages which are not in the cache.


If the page doesn't exist, we have to take the cluster level locks 
(glocks in the case of GFS2) which are potentially expensive, certainly 
a lot more expensive than the page lock anyway. That is currently done 
at the ->readpage() level, although we do have to drop the page lock 
first and then get the locks in the correct order, since the lock 
ordering requires the glock to be taken in shared mode ahead of the page 
lock.


We've always in the past been able to just use the generic code, since 
it was written to not assume i_size was valid outside of the fs specific 
locks. The aim has always been to try and use generic code as much as 
possible, even though there are some cases where we've had to depart 
from that for various reasons.


It appears that the filemap_fault issue seems to have not been spotted 
before. I'm not quite sure how that was missed - seems to show that we 
have some missing tests, but I agree that it does need to be fixed. It 
is a while since I last looked at that particular bit of code in detail, 
so my memory may be a bit fuzzy.


Andreas, Bob, have I missed anything here?

Steve.






Re: [Cluster-devel] [PATCH] mm/filemap: do not allocate cache pages beyond end of file at read

2019-10-30 Thread Steven Whitehouse

Hi,

On 29/10/2019 16:52, Linus Torvalds wrote:

On Tue, Oct 29, 2019 at 3:25 PM Konstantin Khlebnikov
 wrote:

I think all network filesystems which synchronize metadata lazily should be
marked. For example as "SB_VOLATILE". And vfs could handle them specially.

No need. The VFS layer doesn't call generic_file_buffered_read()
directly anyway. It's just a helper function for filesystems to use if
they want to.

They could (and should) make sure the inode size is sufficiently
up-to-date before calling it. And if they want something more
synchronous, they can do it themselves.

But NFS, for example, has open/close consistency, so the metadata
revalidation is at open() time, not at read time.

Linus


NFS may be ok here, but it will break GFS2. There may be others too... 
OCFS2 is likely one. Not sure about CIFS either. Does it really matter 
that we might occasionally allocate a page and then free it again?


Ramfs is a simple test case, but at the same time it doesn't represent 
the complexity of a real world filesystem. I'm just back from a few days 
holiday so apologies if I've missed something earlier on in the discussions,


Steve.



Re: [Cluster-devel] Interest in DAX for OCFS2 and/or GFS2?

2019-10-11 Thread Steven Whitehouse

Hi,

On 11/10/2019 08:21, Gang He wrote:

Hello hayes,


-Original Message-
From: cluster-devel-boun...@redhat.com
[mailto:cluster-devel-boun...@redhat.com] On Behalf Of Hayes, Bill
Sent: 2019年10月11日 0:42
To: ocfs2-de...@oss.oracle.com; cluster-devel@redhat.com
Cc: Rocky (The good-looking one) Craig 
Subject: [Cluster-devel] Interest in DAX for OCFS2 and/or GFS2?

We have been experimenting with distributed file systems across multiple
Linux instances connected to a shared block device.  In our setup, the "disk" is
not a legacy SAN or iSCSI.  Instead it is a shared memory-semantic fabric
that is being presented as a Linux block device.

We have been working with both GFS2 and OCFS2 to evaluate the suitability
to work on our shared memory configuration.  Right now we have gotten
both GFS2 and OCFS2 to work with block driver but each file system still does
block copies.  Our goal is to extend mmap() of the file system(s) to allow true
zero-copy load/store access directly to the memory fabric.  We believe
adding DAX support into the OCFS2 and/or GFS2 is an expedient path to use a
block device that fronts our memory fabric with DAX.

Based on the HW that OCFS2 and GFS2 were built for (iSCSI, FC, DRDB, etc)
there probably has been no reason to implement DAX to date.  The advent of
various memory semantic fabrics (Gen-Z, NUMAlink, etc) is driving our
interest in extending OCFS2 and/or GFS2 to take advantage of DAX.  We
have two platforms set up, one based on actual hardware and another based
on VMs and are eager to begin deeper work.

Has there been any discussion or interest in DAX support in OCFS2?

No, but I think this is very interesting topic/feature.
I hope we can take some efforts in investigating how to make OCFS2 support DAX, 
since some local file systems have supported this feature for long time.


Well, I think it is more accurate to say that the feature has been 
evolving in local filesystems for some time. However, it is moving 
towards time where it makes sense to think about this for clustered 
filesystems, so this is a timely topic for discussion in that sense.




Is there interest from the OCFS2 development community to see DAX support
developed and put upstream?

>From my personal view, it is very attractive.
But we also aware cluster file systems are usually based on DLM, DLM usually 
communicates with each other via the network.
That means network latency should be considered.

Thanks
Gang


Hopefully we can come up with a design which avoids the network latency, 
at least in most cases. With GFS2 direct_io for example, the locking is 
designed such that DLM lock requests are only needed in case of block 
allocation/deallocation. Extending the same concept to DAX should allow 
(after the initial page fault) true DSM via the filesystem. It may be 
able to do even better eventually, but that would be a good starting point.


It has not been something that the GFS2 developers have looked at in any 
detail recently, however it is something that would be interesting, and 
we'd be very happy for someone to work on this and send patches in due 
course,


Steve.





Has there been any discussion or interest in DAX support in GFS2?
Is there interest from the GFS2 development community to see DAX support
developed and put upstream?

Regards,
Bill









Re: [Cluster-devel] [GFS2 PATCH 2/2] gfs2: Use async glocks for rename

2019-08-15 Thread Steven Whitehouse

Hi,

On 15/08/2019 14:41, Bob Peterson wrote:

I just noticed the parameter comments for gfs2_glock_async_wait
are wrong, and I've fixed them in a newer version. I can post
the new version after I get people's initial reactions.

Bob


Overall this looks like a much better approach. We know that this 
doesn't happen very often, so the path which involves the timeout should 
be very rarely taken. The problem is how to select a suitable timeout... 
is 2 secs enough? Can we land up with a DoS in certain situations? 
Hopefully not, but we should take care.


The shared wait queue might also be an issue in terms of contention, so 
it might be worth looking at how to avoid that. Generally though, this 
is looking very promising I think,


Steve.



- Original Message -

Because s_vfs_rename_mutex is not cluster-wide, multiple nodes can
reverse the roles of which directories are "old" and which are "new"
for the purposes of rename. This can cause deadlocks where two nodes
can lock old-then-new but since their roles are reversed, they wait
for each other.

This patch fixes the problem by acquiring all gfs2_rename's inode
glocks asychronously and waits for all glocks to be acquired.
That way all inodes are locked regardless of the order.

Signed-off-by: Bob Peterson 
---

(snip)

+ * gfs2_glock_async_wait - wait on multiple asynchronous glock acquisitions
+ * @gh: the glock holder

(snip)

+int gfs2_glock_async_wait(unsigned int num_gh, struct gfs2_holder *ghs)






Re: [Cluster-devel] [GFS2 PATCH] gfs2: eliminate circular lock dependency in inode.c

2019-08-12 Thread Steven Whitehouse

Hi,

On 12/08/2019 14:43, Bob Peterson wrote:

- Original Message -

The real problem came with renames, though. Function
gfs2_rename(), which locked a series of inode glocks, did so
in parent-child order due to that patch. But it was still
possible to create circular lock dependencies just by doing the
wrong combination of renames on different nodes. For example:

Node a: mv /mnt/gfs2/sub /mnt/gfs2/tmp_name (rename sub to tmp_name)

a1. Same directory, so rename glock is NOT held
a2. /mnt/gfs2 is locked
a3. Tries to lock sub for rename, but it is locked on node b

Node b: mv /mnt/gfs2/sub /mnt/gfs2/dir1/ (move sub to dir1...
  mv /mnt/gfs2/dir1/sub /mnt/gfs2/  ...then move it back)

b1. Different directory, so rename glock IS held
b2. /mnt/gfs2 is locked
b3. dir1 is locked
b4. sub is moved to dir1 and everything is unlocked
b5. Different directory, so rename glock IS held again
b6. dir1 is locked
b7. Lock for /mnt/gfs2 is requested, but cannot be granted because
  node 1 locked it in step a2.

If the parents are being locked before the child, as per the correct
locking order, then this cannot happen. The directory in which the child
is located should always be locked first, before the child, so that is
what protects the operation on a from whatever might be going on, on node b.

When you get to step b7, sub is not locked (since it was unlocked in b4)
and not locked again. Thus a3 can complete. So this doesn't look like it
is the right explanation.

Hi,

I guess maybe my explanation is lacking.
It's not so much a relationship between "parent" and "child"
directories as it is "old" and "new" directories.

The comments for function vfs_rename() explain the situations in which
this can happen, and have been prevented on a single node through the
use of s_vfs_rename_mutex. However, that mutex is not cluster-wide,
which means the relationship of which inode is the "old" and which
inode is the "new" can change indiscriminately without notice and
without cluster-wide locking. The whole point of the "a" and "b"
scenarios was to illustrate that one node can lock "old", then "new",
but the other node can reverse the roles of those same inodes (which
is the "old" and which is the "new") and therefore reverse the lock
order without notice.

Since the old-new relationship itself is not protected, we need
some other way to get the lock order correct.

My first attempt to fix this was to extend the "rename" glock to have
a rename-wide reach so it affected both types of renames rather than
today's code which only locks old and new if they're different.
I implemented this with a new i_op called by vfs (vfs_rename) to make
the rename glock serve as a kind of cluster-wide version vfs's
s_vfs_rename_mutex. However, this ended up having a huge performance
penalty for my test.

My second attempt (the patch I posted) was to lock the inodes in
block-number-sort order, because the block number relationships
will never change, regardless of which is old and which is new.
It made no sense to me to reinvent the wheel wrt locking them in
sorted order, so I used gfs2_glock_nq_m which already does that.

Regards,

Bob Peterson


We are doing our best to get rid of the _m glock functions. Sorting 
things in block number order is a bit of a hack and it would be better 
to spend the time reducing the number of glocks involved in each 
operation overall.


I have wondered about the performance issues on the rename glock. Simply 
using that for everything is the obvious easy fix, but perhaps not 
surprising that you've seen some performance issues with that approach. 
I wonder if we can come up with a solution to break up the single rename 
glock into separate glocks using a hashing scheme, or some similar 
system. That way we might get the advantages of both improved speed and 
to retain the parent/child locking.


Either way, changing the lock ordering of lots of other bits of code is 
a non-starter, since then it will be incompatible with the way gfs2 has 
worked since it was created, and also incompatible with the vfs's own 
locking order that is used for local locks too.


Lets see if we can figure out a solution that will just address this 
particular issue on its own,


Steve.




Re: [Cluster-devel] [GFS2 PATCH] gfs2: eliminate circular lock dependency in inode.c

2019-08-12 Thread Steven Whitehouse

Hi,

On 09/08/2019 19:58, Bob Peterson wrote:

Hi,

This patch fixes problems caused by regressions from patch
"GFS2: rm on multiple nodes causes panic" from 2008,
72dbf4790fc6736f9cb54424245114acf0b0038c, which was an earlier
attempt to fix very similar problems.

The original problem for which it was written had to do with
simultaneous link, unlink, rmdir and rename operations on
multiple nodes that interfered with one another, due to the
lock ordering. The problem was that the lock ordering was
not consistent between the operations.

The defective patch put in place to solve it (and hey, it
worked for more than 10 years) changed the lock ordering so
that the parent directory glock was always locked before the
child. This almost always worked. Almost. The rmdir version
was still wrong because the rgrp glock was added to the holder
array, which was sorted, and the locks were acquired in sorted
order. That is counter to the locking requirements documented
in: Documentation/filesystems/gfs2-glocks.txt which states the
rgrp glock glock must always be locked after the inode glocks.


Yes, that does need fixing, however it also doesn't entirely make sense, 
because the parent in that case is locked, but is not being removed, so 
it's rgrp would not need to be added to the transaction. anyway.





The real problem came with renames, though. Function
gfs2_rename(), which locked a series of inode glocks, did so
in parent-child order due to that patch. But it was still
possible to create circular lock dependencies just by doing the
wrong combination of renames on different nodes. For example:

Node a: mv /mnt/gfs2/sub /mnt/gfs2/tmp_name (rename sub to tmp_name)

a1. Same directory, so rename glock is NOT held
a2. /mnt/gfs2 is locked
a3. Tries to lock sub for rename, but it is locked on node b

Node b: mv /mnt/gfs2/sub /mnt/gfs2/dir1/ (move sub to dir1...
 mv /mnt/gfs2/dir1/sub /mnt/gfs2/  ...then move it back)

b1. Different directory, so rename glock IS held
b2. /mnt/gfs2 is locked
b3. dir1 is locked
b4. sub is moved to dir1 and everything is unlocked
b5. Different directory, so rename glock IS held again
b6. dir1 is locked
b7. Lock for /mnt/gfs2 is requested, but cannot be granted because
 node 1 locked it in step a2.


If the parents are being locked before the child, as per the correct 
locking order, then this cannot happen. The directory in which the child 
is located should always be locked first, before the child, so that is 
what protects the operation on a from whatever might be going on, on node b.


When you get to step b7, sub is not locked (since it was unlocked in b4) 
and not locked again. Thus a3 can complete. So this doesn't look like it 
is the right explanation.





(Note that the nodes must be different, otherwise the vfs inode
level locking prevents the problem on a single node).

Thus, we get into a glock deadlock that looks like this:

host-018:
G:  s:EX n:2/3347 f:DyIqob t:EX d:UN/2368172000 a:0 v:0 r:3 m:150 

[Cluster-devel] The last 64 bits

2019-07-30 Thread Steven Whitehouse

Hi,

We currently have 64 bits left over in the gfs2 metadata header. This is 
currently just padding, although we do zero it out when we add metadata 
blocks to transactions. I would like to ensure that we make the most of 
this space, and I've got a couple of ideas of how best to use it.


Firstly, we should be able to add checksums to our metadata quite 
easily. A crc32 would use half of the space available, and we should 
probably do the checksum at the point where we write the data into the 
log, so that it is then also correct for when it is written back in place.


The more tricky issue is what to do with the remaining 32 bits. One 
thought, is to come up with some scheme which (eventually) allows us to 
avoid having to write out revokes to the log. This would significantly 
speed up moving glocks from node to node, halving the number of log 
flushes when metadata has been updated. We could make it into a 
generation number, but is 32 bits enough? By incrementing it 
individually on each bit of metadata each time it goes through the log, 
we would get a better picture of whats going on, rather than just 
copying the low 32 bits of the log sequence number I think. That should 
be enough to make sure that we'd keep out of trouble even if someone is 
using a large (non-default) log size.


The question is whether there is anything else we might use those 32 
bits for that might give us an even bigger gain? Any thoughts?


I should mention that this is all for metadata - we'd have to do 
something different for jdata, since it doesn't have a header in this 
sense, but, one thing at a time!


Steve.




Re: [Cluster-devel] [PATCH] fs: gfs2: Fix a null-pointer dereference in gfs2_alloc_inode()

2019-07-24 Thread Steven Whitehouse

Hi,

On 24/07/2019 11:27, Christoph Hellwig wrote:

On Wed, Jul 24, 2019 at 11:22:46AM +0100, Steven Whitehouse wrote:

and it would have the same effect, so far as I can tell. I don't mind
changing it, if that is perhaps a clearer way to write the same thing,
rather than >i_inode;

The cleanest thing is to not rely on any of that magic and write it
like all other file systems:

ip = kmem_cache_alloc
if (!ip)
retuturn NULL;

...

return >i_inode;

Absolutely not point in trying to be clever here.


Yes, that works too,

Steve.




Re: [Cluster-devel] [PATCH] fs: gfs2: Fix a null-pointer dereference in gfs2_alloc_inode()

2019-07-24 Thread Steven Whitehouse

Hi,

On 24/07/2019 11:02, Christoph Hellwig wrote:

On Wed, Jul 24, 2019 at 09:48:38AM +0100, Steven Whitehouse wrote:

Hi,

On 24/07/2019 09:43, Jia-Ju Bai wrote:

In gfs2_alloc_inode(), when kmem_cache_alloc() on line 1724 returns
NULL, ip is assigned to NULL. In this case, "return >i_inode" will
cause a null-pointer dereference.

To fix this null-pointer dereference, NULL is returned when ip is NULL.

This bug is found by a static analysis tool STCheck written by us.

The bug is in the tool I'm afraid. Since i_inode is the first element of ip,
there is no NULL dereference here. A pointer to ip->i_inode and a pointer to
ip are one and the same (bar the differing types) which is the reason that
we return >i_inode rather than just ip,

But that doesn't help if ip is NULL, as dereferencing a field in in
still remains invalid behavior.


We are not dereferencing it though really, we are taking the address of 
the field... we could have written:


return (struct inode *)ip;

and it would have the same effect, so far as I can tell. I don't mind 
changing it, if that is perhaps a clearer way to write the same thing, 
rather than >i_inode;


Steve.




Re: [Cluster-devel] [BUG] fs: gfs2: possible null-pointer dereferences in gfs2_rgrp_bh_get()

2019-07-24 Thread Steven Whitehouse

Hi,

On 24/07/2019 09:50, Jia-Ju Bai wrote:
In gfs2_rgrp_bh_get, there is an if statement on line 1191 to check 
whether "rgd->rd_bits[0].bi_bh" is NULL.


That is how we detect whether the rgrp has already been read in, so the 
function is skipped in the case that we've already read in the rgrp.




When "rgd->rd_bits[0].bi_bh" is NULL, it is used on line 1216:
    gfs2_rgrp_in(rgd, (rgd->rd_bits[0].bi_bh)->b_data);


No it isn't. See line 1196 where bi_bh is set, and where we also bail 
out (line 1198) in case it has not been set.




and on line 1225:
    gfs2_rgrp_ondisk2lvb(..., rgd->rd_bits[0].bi_bh->b_data);
and on line 1228:
    if (!gfs2_rgrp_lvb_valid(rgd))

Note that in gfs2_rgrp_lvb_valid(rgd), there is a statement on line 1114:
    struct gfs2_rgrp *str = (struct gfs2_rgrp 
*)rgd->rd_bits[0].bi_bh->b_data;


Thus, possible null-pointer dereferences may occur.

These bugs are found by a static analysis tool STCheck written by us.
I do not know how to correctly fix these bugs, so I only report bugs.


Best wishes,
Jia-Ju Bai

So I'm not seeing how there can be a NULL deref in those later lines. I 
think this is another false positive,


Steve.





Re: [Cluster-devel] [PATCH] fs: gfs2: Fix a null-pointer dereference in gfs2_alloc_inode()

2019-07-24 Thread Steven Whitehouse

Hi,

On 24/07/2019 09:43, Jia-Ju Bai wrote:

In gfs2_alloc_inode(), when kmem_cache_alloc() on line 1724 returns
NULL, ip is assigned to NULL. In this case, "return >i_inode" will
cause a null-pointer dereference.

To fix this null-pointer dereference, NULL is returned when ip is NULL.

This bug is found by a static analysis tool STCheck written by us.


The bug is in the tool I'm afraid. Since i_inode is the first element of 
ip, there is no NULL dereference here. A pointer to ip->i_inode and a 
pointer to ip are one and the same (bar the differing types) which is 
the reason that we return >i_inode rather than just ip,


Steve.




Signed-off-by: Jia-Ju Bai 
---
  fs/gfs2/super.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 0acc5834f653..c07c3f4f8451 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1728,8 +1728,9 @@ static struct inode *gfs2_alloc_inode(struct super_block 
*sb)
memset(>i_res, 0, sizeof(ip->i_res));
RB_CLEAR_NODE(>i_res.rs_node);
ip->i_rahead = 0;
-   }
-   return >i_inode;
+   return >i_inode;
+   } else
+   return NULL;
  }
  
  static void gfs2_free_inode(struct inode *inode)




Re: [Cluster-devel] [GFS2 PATCH v3 00/19] gfs2: misc recovery patch collection

2019-04-30 Thread Steven Whitehouse

Hi,

On 01/05/2019 00:03, Bob Peterson wrote:

Here is version 3 of the patch set I posted on 23 April. It is revised
based on bugs I found testing with xfstests.

This is a collection of patches I've been using to address the myriad
of recovery problems I've found. I'm still finding them, so the battle
is not won yet. I'm not convinced we need all of these but I thought
I'd send them anyway and get feedback. Previously I sent out a version
of the patch "gfs2: Force withdraw to replay journals and wait for it
to finish" that was too big and complex. So I broke it up into four
patches, starting with "move check_journal_clean to util.c for future
use". So those four need to be a set. There aren't many other dependencies
between patches, so the others could probably be taken or rejected
individually. There are more patches I still need to perfect, but maybe
a few of the safer ones can be pushed to for-next.


There is quite a lot of good stuff here, but it would be good to split 
it up and make it clearer what is bug fixes, and what are new features,


Steve.



Bob Peterson (19):
   gfs2: kthread and remount improvements
   gfs2: eliminate tr_num_revoke_rm
   gfs2: log which portion of the journal is replayed
   gfs2: Warn when a journal replay overwrites a rgrp with buffers
   gfs2: Introduce concept of a pending withdraw
   gfs2: log error reform
   gfs2: Only complain the first time an io error occurs in quota or log
   gfs2: Stop ail1 wait loop when withdrawn
   gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
   gfs2: move check_journal_clean to util.c for future use
   gfs2: Allow some glocks to be used during withdraw
   gfs2: Don't loop forever in gfs2_freeze if withdrawn
   gfs2: Make secondary withdrawers wait for first withdrawer
   gfs2: Don't write log headers after file system withdraw
   gfs2: Force withdraw to replay journals and wait for it to finish
   gfs2: simply gfs2_freeze by removing case
   gfs2: Add verbose option to check_journal_clean
   gfs2: Check for log write errors before telling dlm to unlock
   gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty

  fs/gfs2/aops.c   |   4 +-
  fs/gfs2/file.c   |   2 +-
  fs/gfs2/glock.c  |  82 +++--
  fs/gfs2/glock.h  |   1 +
  fs/gfs2/glops.c  | 100 +++--
  fs/gfs2/glops.h  |   3 +-
  fs/gfs2/incore.h |  21 -
  fs/gfs2/inode.c  |  14 ++-
  fs/gfs2/lock_dlm.c   |  50 +++
  fs/gfs2/log.c|  55 +++-
  fs/gfs2/log.h|   1 +
  fs/gfs2/lops.c   |  27 +-
  fs/gfs2/meta_io.c|   6 +-
  fs/gfs2/ops_fstype.c |  65 --
  fs/gfs2/quota.c  |  10 ++-
  fs/gfs2/recovery.c   |   3 +-
  fs/gfs2/super.c  |  88 +++---
  fs/gfs2/super.h  |   1 +
  fs/gfs2/sys.c|  14 ++-
  fs/gfs2/trans.c  |   6 +-
  fs/gfs2/util.c   | 206 ---
  fs/gfs2/util.h   |  15 
  22 files changed, 617 insertions(+), 157 deletions(-)





Re: [Cluster-devel] [GFS2 PATCH v3 09/19] gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn

2019-04-30 Thread Steven Whitehouse

Hi,

On 01/05/2019 00:03, Bob Peterson wrote:

This patch addresses various problems with gfs2/dlm recovery.

For example, suppose a node with a bunch of gfs2 mounts suddenly
reboots due to kernel panic, and dlm determines it should perform
recovery. DLM does so from a pseudo-state machine calling various
callbacks into lock_dlm to perform a sequence of steps. It uses
generation numbers and recover bits in dlm "control" lock lvbs.

Now suppose another node tries to recover the failed node's
journal, but in so doing, encounters an IO error or withdraws
due to unforeseen circumstances, such as an hba driver failure.
In these cases, the recovery would eventually bail out, but it
would still update its generation number in the lvb. The other
nodes would all see the newer generation number and think they
don't need to do recovery because the generation number is newer
than the last one they saw, and therefore someone else has already
taken care of it.

If the file system has an io error or is withdrawn, it cannot
safely replay any journals (its own or others) but someone else
still needs to do it. Therefore we don't want it messing with
the journal recovery generation numbers: the local generation
numbers eventually get put into the lvb generation numbers to be
seen by all nodes.

This patch adds checks to many of the callbacks used by dlm
in its recovery state machine so that the functions are ignored
and skipped if an io error has occurred or if the file system
was withdraw.

Signed-off-by: Bob Peterson 


These should probably propagate the error back to the caller of the 
recovery request. We do have a proper notification system for failed 
recovery via uevents,


Steve.


---
  fs/gfs2/lock_dlm.c | 18 ++
  fs/gfs2/util.c | 15 +++
  2 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 31df26ed7854..9329f86ffcbe 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -1081,6 +1081,10 @@ static void gdlm_recover_prep(void *arg)
struct gfs2_sbd *sdp = arg;
struct lm_lockstruct *ls = >sd_lockstruct;
  
+	if (gfs2_withdrawn(sdp)) {

+   fs_err(sdp, "recover_prep ignored due to withdraw.\n");
+   return;
+   }
spin_lock(>ls_recover_spin);
ls->ls_recover_block = ls->ls_recover_start;
set_bit(DFL_DLM_RECOVERY, >ls_recover_flags);
@@ -1103,6 +1107,11 @@ static void gdlm_recover_slot(void *arg, struct dlm_slot 
*slot)
struct lm_lockstruct *ls = >sd_lockstruct;
int jid = slot->slot - 1;
  
+	if (gfs2_withdrawn(sdp)) {

+   fs_err(sdp, "recover_slot jid %d ignored due to withdraw.\n",
+  jid);
+   return;
+   }
spin_lock(>ls_recover_spin);
if (ls->ls_recover_size < jid + 1) {
fs_err(sdp, "recover_slot jid %d gen %u short size %d\n",
@@ -1127,6 +1136,10 @@ static void gdlm_recover_done(void *arg, struct dlm_slot 
*slots, int num_slots,
struct gfs2_sbd *sdp = arg;
struct lm_lockstruct *ls = >sd_lockstruct;
  
+	if (gfs2_withdrawn(sdp)) {

+   fs_err(sdp, "recover_done ignored due to withdraw.\n");
+   return;
+   }
/* ensure the ls jid arrays are large enough */
set_recover_size(sdp, slots, num_slots);
  
@@ -1154,6 +1167,11 @@ static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,

  {
struct lm_lockstruct *ls = >sd_lockstruct;
  
+	if (gfs2_withdrawn(sdp)) {

+   fs_err(sdp, "recovery_result jid %d ignored due to withdraw.\n",
+  jid);
+   return;
+   }
if (test_bit(DFL_NO_DLM_OPS, >ls_recover_flags))
return;
  
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c

index 0a814ccac41d..7eaea6dfe1cf 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -259,14 +259,13 @@ void gfs2_io_error_bh_i(struct gfs2_sbd *sdp, struct 
buffer_head *bh,
const char *function, char *file, unsigned int line,
bool withdraw)
  {
-   if (!test_bit(SDF_SHUTDOWN, >sd_flags))
-   fs_err(sdp,
-  "fatal: I/O error\n"
-  "  block = %llu\n"
-  "  function = %s, file = %s, line = %u\n",
-  (unsigned long long)bh->b_blocknr,
-  function, file, line);
+   if (gfs2_withdrawn(sdp))
+   return;
+
+   fs_err(sdp, "fatal: I/O error\n"
+  "  block = %llu\n"
+  "  function = %s, file = %s, line = %u\n",
+  (unsigned long long)bh->b_blocknr, function, file, line);
if (withdraw)
gfs2_lm_withdraw(sdp, NULL);
  }
-




Re: [Cluster-devel] [PATCH] libgfs2: Import gfs2_ondisk.h

2019-04-09 Thread Steven Whitehouse

Hi,

On 09/04/2019 13:48, Andrew Price wrote:



On 09/04/2019 13:21, Steven Whitehouse wrote:

Hi,

On 09/04/2019 13:18, Andrew Price wrote:

On 09/04/2019 13:03, Christoph Hellwig wrote:

On Tue, Apr 09, 2019 at 10:41:53AM +0100, Andrew Price wrote:

Give gfs2-utils its own copy of gfs2_ondisk.h which uses userspace
types. This allows us to always support the latest ondisk 
structures and

obsoletes a lot of #ifdef GFS2_HAS_ blocks and configure.ac
checks.

gfs2_ondisk.h was changed simply by search-and-replace of the 
kernel int

types with the uintN_t, i.e.:

:%s/__u\(8\|16\|32\|64\)/uint\1_t/g
:%s/__be\(64\|32\|16\|8\)/uint\1_t/g

and the linux/types.h include replaced with stdint.h


Why?


Because I'd like to be able to build gfs2-utils on FreeBSD one day. 
Plus we get the handy stuff in inttypes.h to use, Linux doesn't have 
that.



At least the be types give you really useful type checking with
sparse, which can be trivially wired up in userspace as well.


If you mean the bitwise annotations that only sparse checks, we're 
fairly safe in gfs2-utils in that anything represented by a struct 
is going to have been parsed through one of the libgfs2/ondisk.c 
functions so will be the right endianness. I run sparse over this 
code very rarely anyway.


Those conversion functions are not sensible, thats why we got rid of 
them from the kernel code. 


Is it the functions that aren't sensible or the use of the 
gfs2_ondisk.h structs as the containers for the native endian data? 
I'm not sure I get why the kernel functions like gfs2_dinode_in() are 
considered sensible and gfs2-utils' gfs2_dinode_in(), which does a 
similar thing but with a different struct, isn't sensible.


Well in general we don't want to convert lots of fields in what is 
basically a copy. The inode, when it is read in is an exception to that 
mainly because we have to in order to make sure that the vfs level data 
is all up to date. Keeping the structs as containers is useful, so yes 
we want to retain that. In many cases though we only need a few fields 
from what can be quite large data structures, so in those cases we 
should read/update the fields that we care about for that particular 
operation, rather than converting the whole data structure each time. We 
got a fair speed up when we made that change in the kernel.


So generally I'd like to discourage the blanket conversion functions, 
though it is likely we'll need to retain a few of them, in favour of 
converting just the required fields at the point of use. This should be 
safe to do given that we have the ability to do compile time type 
checking - and lets try and include that in the tests that are always 
run before check in, to make sure that we don't land up with any 
mistakes. That would be a good addition to the tests I think,


Steve.




Re: [Cluster-devel] [PATCH] libgfs2: Import gfs2_ondisk.h

2019-04-09 Thread Steven Whitehouse

Hi,

On 09/04/2019 13:18, Andrew Price wrote:

On 09/04/2019 13:03, Christoph Hellwig wrote:

On Tue, Apr 09, 2019 at 10:41:53AM +0100, Andrew Price wrote:

Give gfs2-utils its own copy of gfs2_ondisk.h which uses userspace
types. This allows us to always support the latest ondisk structures 
and

obsoletes a lot of #ifdef GFS2_HAS_ blocks and configure.ac
checks.

gfs2_ondisk.h was changed simply by search-and-replace of the kernel 
int

types with the uintN_t, i.e.:

:%s/__u\(8\|16\|32\|64\)/uint\1_t/g
:%s/__be\(64\|32\|16\|8\)/uint\1_t/g

and the linux/types.h include replaced with stdint.h


Why?


Because I'd like to be able to build gfs2-utils on FreeBSD one day. 
Plus we get the handy stuff in inttypes.h to use, Linux doesn't have 
that.



At least the be types give you really useful type checking with
sparse, which can be trivially wired up in userspace as well.


If you mean the bitwise annotations that only sparse checks, we're 
fairly safe in gfs2-utils in that anything represented by a struct is 
going to have been parsed through one of the libgfs2/ondisk.c 
functions so will be the right endianness. I run sparse over this code 
very rarely anyway.


Those conversion functions are not sensible, thats why we got rid of 
them from the kernel code. It is better to have a set of types that have 
the endianess specified so that we can use sparse. Compile time checking 
is always a good plan where it is possible.






Also
keeping the file 1:1 the same is going to make your life much easier
in the future..


It's really no difficulty to run the above substitutions the next time 
the file changes, but gfs2_ondisk.h changes once in a blue moon anyway 
so the maintenance overhead is going to be tiny.


Andy


Thats true, but lets keep the ability to do endianess checks,

Steve.




Re: [Cluster-devel] [GFS2 PATCH v3] gfs2: clean_journal improperly set sd_log_flush_head

2019-04-02 Thread Steven Whitehouse

Hi,

On 28/03/2019 17:10, Bob Peterson wrote:

Hi,

Andreas found some problems with the previous version. Here is version 3.

Ross: Can you please test this one with your scenario? Thanks.

Bob Peterson
---

This patch fixes regressions in 588bff95c94efc05f9e1a0b19015c9408ed7c0ef.
Due to that patch, function clean_journal was setting the value of
sd_log_flush_head, but that's only valid if it is replaying the node's
own journal. If it's replaying another node's journal, that's completely
wrong and will lead to multiple problems. This patch tries to clean up
the mess by passing the value of the logical journal block number into
gfs2_write_log_header so the function can treat non-owned journals
generically. For the local journal, the journal extent map is used for
best performance. For other nodes from other journals, gfs2_extent_map
is called to figure it out.

This patch also tries to establish more consistency when passing journal
block parameters by changing several unsigned int types to a consistent
u32.

Fixes: 588bff95c94e ("GFS2: Reduce code redundancy writing log headers")

Signed-off-by: Bob Peterson 
---
  fs/gfs2/incore.h   |  2 +-
  fs/gfs2/log.c  | 26 +++---
  fs/gfs2/log.h  |  3 ++-
  fs/gfs2/lops.c |  6 +++---
  fs/gfs2/lops.h |  2 +-
  fs/gfs2/recovery.c |  8 
  fs/gfs2/recovery.h |  2 +-
  7 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index cdf07b408f54..86840a70ee1a 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -535,7 +535,7 @@ struct gfs2_jdesc {
unsigned long jd_flags;
  #define JDF_RECOVERY 1
unsigned int jd_jid;
-   unsigned int jd_blocks;
+   u32 jd_blocks;
int jd_recover_error;
/* Replay stuff */
  
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c

index b8830fda51e8..8a5a19a26582 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -672,13 +672,15 @@ void gfs2_write_revokes(struct gfs2_sbd *sdp)
   * @seq: sequence number
   * @tail: tail of the log
   * @flags: log header flags GFS2_LOG_HEAD_*
+ * @lblock: value for lh_blkno (block number relative to start of journal)
   * @op_flags: flags to pass to the bio
   *
   * Returns: the initialized log buffer descriptor
   */
  
  void gfs2_write_log_header(struct gfs2_sbd *sdp, struct gfs2_jdesc *jd,

-  u64 seq, u32 tail, u32 flags, int op_flags)
+  u64 seq, u32 tail, u32 flags, u32 lblock,
+  int op_flags)
  {
struct gfs2_log_header *lh;
u32 hash, crc;
@@ -686,7 +688,7 @@ void gfs2_write_log_header(struct gfs2_sbd *sdp, struct 
gfs2_jdesc *jd,
struct gfs2_statfs_change_host *l_sc = >sd_statfs_local;
struct timespec64 tv;
struct super_block *sb = sdp->sd_vfs;
-   u64 addr;
+   u64 dblock;
  
  	lh = page_address(page);

clear_page(lh);
@@ -699,15 +701,25 @@ void gfs2_write_log_header(struct gfs2_sbd *sdp, struct 
gfs2_jdesc *jd,
lh->lh_sequence = cpu_to_be64(seq);
lh->lh_flags = cpu_to_be32(flags);
lh->lh_tail = cpu_to_be32(tail);
-   lh->lh_blkno = cpu_to_be32(sdp->sd_log_flush_head);
+   lh->lh_blkno = cpu_to_be32(lblock);
hash = ~crc32(~0, lh, LH_V1_SIZE);
lh->lh_hash = cpu_to_be32(hash);
  
  	ktime_get_coarse_real_ts64();

lh->lh_nsec = cpu_to_be32(tv.tv_nsec);
lh->lh_sec = cpu_to_be64(tv.tv_sec);
-   addr = gfs2_log_bmap(sdp);
-   lh->lh_addr = cpu_to_be64(addr);
+   if (jd->jd_jid == sdp->sd_lockstruct.ls_jid)
+   dblock = gfs2_log_bmap(sdp);
+   else {
+   u32 extlen;
+   int new = 0, error;
+
+   error = gfs2_extent_map(jd->jd_inode, lblock, , ,
+   );


We should not be adding new calls to gfs2_extent_map() here since that 
function is obsolete and deprecated. It looks like perhaps we should 
have a parameter to gfs2_log_bmap() to indicate which journal we need to 
map?


Steve.


+   if (gfs2_assert_withdraw(sdp, error == 0))
+   return;
+   }
+   lh->lh_addr = cpu_to_be64(dblock);
lh->lh_jinode = cpu_to_be64(GFS2_I(jd->jd_inode)->i_no_addr);
  
  	/* We may only write local statfs, quota, etc., when writing to our

@@ -732,7 +744,7 @@ void gfs2_write_log_header(struct gfs2_sbd *sdp, struct 
gfs2_jdesc *jd,
 sb->s_blocksize - LH_V1_SIZE - 4);
lh->lh_crc = cpu_to_be32(crc);
  
-	gfs2_log_write(sdp, page, sb->s_blocksize, 0, addr);

+   gfs2_log_write(sdp, page, sb->s_blocksize, 0, dblock);
gfs2_log_submit_bio(>sd_log_bio, REQ_OP_WRITE, op_flags);
log_flush_wait(sdp);
  }
@@ -761,7 +773,7 @@ static void log_write_header(struct gfs2_sbd *sdp, u32 
flags)
}
sdp->sd_log_idle = (tail == sdp->sd_log_flush_head);
gfs2_write_log_header(sdp, sdp->sd_jdesc, sdp->sd_log_sequence++, tail,
-

Re: [Cluster-devel] [PATCH 0/2] gfs2: Switch to the new mount API

2019-03-18 Thread Steven Whitehouse



Thanks for sorting this out so quickly,

Steve.

On 17/03/2019 17:40, Andrew Price wrote:

These patches convert gfs2 and gfs2meta to use fs_context.

In both cases we still use sget() instead of sget_fc() as there doesn't seem to
be a clear idiomatic way to propagate the bdev currently.

Tested with xfstests -g quick, a bunch of targeted mount commands to exercise
individual options, and gfs2_grow to test the gfs2meta mount.

I'm aiming this at 5.2 so it'll have plenty of soak time.

Thanks to David Howells for providing the method for parsing the complicated
'quota' option.

Andrew Price (2):
   gfs2: Convert gfs2 to fs_context
   gfs2: Convert gfs2meta to fs_context

  fs/gfs2/incore.h |   8 +-
  fs/gfs2/ops_fstype.c | 418 +--
  fs/gfs2/super.c  | 335 +-
  fs/gfs2/super.h  |   3 +-
  4 files changed, 373 insertions(+), 391 deletions(-)





Re: [Cluster-devel] [GFS2 PATCH 8/9] gfs2: Do log_flush in gfs2_ail_empty_gl even if ail list is empty

2019-02-15 Thread Steven Whitehouse



On 13/02/2019 15:21, Bob Peterson wrote:

Before this patch, if gfs2_ail_empty_gl saw there was nothing on
the ail list, it would return and not flush the log. The problem
is that there could still be a revoke for the rgrp sitting on the
sd_log_le_revoke list that's been recently taken off the ail list.
But that revoke still needs to be written, and the rgrp_go_inval
still needs to call log_flush_wait to ensure the revokes are all
properly written to the journal before we relinquish control of
the glock to another node. If we give the glock to another node
before we have this knowledge, the node might crash and its journal
replayed, in which case the missing revoke would allow the journal
replay to replay the rgrp over top of the rgrp we already gave to
another node, thus overwriting its changes and corrupting the
file system.

This patch makes gfs2_ail_empty_gl still call gfs2_log_flush rather
than returning.

Signed-off-by: Bob Peterson 


Yes, that looks like a good solution,

Steve.


---
  fs/gfs2/glops.c | 23 +--
  fs/gfs2/log.c   |  2 +-
  fs/gfs2/log.h   |  1 +
  3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 64b8e5e808d8..adae9ecf8311 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -94,8 +94,26 @@ static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
INIT_LIST_HEAD(_databuf);
tr.tr_revokes = atomic_read(>gl_ail_count);
  
-	if (!tr.tr_revokes)

-   return;
+   if (!tr.tr_revokes) {
+   /**
+* We have nothing on the ail, but there could be revokes on
+* the sdp revoke queue, in which case, we still want to flush
+* the log and wait for it to finish.
+*
+* If the sdp revoke list is empty too, we might still have an
+* io outstanding for writing revokes, so we should wait for
+* it before proceeding.
+*
+* If none of these conditions are true, our revokes are all
+* flushed and we can return.
+*/
+   if (!list_empty(>sd_log_le_revoke))
+   goto flush;
+   else if (atomic_read(>sd_log_in_flight))
+   log_flush_wait(sdp);
+   else
+   return;
+   }
  
  	/* A shortened, inline version of gfs2_trans_begin()

   * tr->alloced is not set since the transaction structure is
@@ -110,6 +128,7 @@ static void gfs2_ail_empty_gl(struct gfs2_glock *gl)
__gfs2_ail_flush(gl, 0, tr.tr_revokes);
  
  	gfs2_trans_end(sdp);

+flush:
gfs2_log_flush(sdp, NULL, GFS2_LOG_HEAD_FLUSH_NORMAL |
   GFS2_LFC_AIL_EMPTY_GL);
  }
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 0d0dec3231c9..610cd2637dc5 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -515,7 +515,7 @@ static void log_pull_tail(struct gfs2_sbd *sdp, unsigned 
int new_tail)
  }
  
  
-static void log_flush_wait(struct gfs2_sbd *sdp)

+void log_flush_wait(struct gfs2_sbd *sdp)
  {
DEFINE_WAIT(wait);
  
diff --git a/fs/gfs2/log.h b/fs/gfs2/log.h

index 1bc9bd444b28..bd2d08d0f21c 100644
--- a/fs/gfs2/log.h
+++ b/fs/gfs2/log.h
@@ -75,6 +75,7 @@ extern void gfs2_log_flush(struct gfs2_sbd *sdp, struct 
gfs2_glock *gl,
   u32 type);
  extern void gfs2_log_commit(struct gfs2_sbd *sdp, struct gfs2_trans *trans);
  extern void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct writeback_control 
*wbc);
+extern void log_flush_wait(struct gfs2_sbd *sdp);
  
  extern void gfs2_log_shutdown(struct gfs2_sbd *sdp);

  extern int gfs2_logd(void *data);




Re: [Cluster-devel] [GFS2 PATCH 5/9] gfs2: Keep transactions on ail1 list until after issuing revokes

2019-02-15 Thread Steven Whitehouse

Hi,

On 13/02/2019 15:21, Bob Peterson wrote:

Before this patch, function gfs2_write_revokes would call function
gfs2_ail1_empty, then run the ail1 list, issuing revokes. But
gfs2_ail1_empty can move transactions to the ail2 list, and thus,
their revokes were never issued. This patch adds a new parameter to
gfs2_ail1_empty that allows the transactions to remain on the ail1
list until it can issue revokes for them. Then, if they have no more
buffers, they're moved to the ail2 list after the revokes are added.

Signed-off-by: Bob Peterson 


Why would we need to do this?

Steve.


---
  fs/gfs2/log.c | 30 ++
  1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 81550038ace3..0d0dec3231c9 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -217,11 +217,12 @@ static void gfs2_ail1_empty_one(struct gfs2_sbd *sdp, 
struct gfs2_trans *tr)
  /**
   * gfs2_ail1_empty - Try to empty the ail1 lists
   * @sdp: The superblock
+ * @move_empty_to_ail2: 1 if transaction to be moved to ail2 when empty
   *
   * Tries to empty the ail1 lists, starting with the oldest first
   */
  
-static int gfs2_ail1_empty(struct gfs2_sbd *sdp)

+static int gfs2_ail1_empty(struct gfs2_sbd *sdp, bool move_empty_to_ail2)
  {
struct gfs2_trans *tr, *s;
int oldest_tr = 1;
@@ -230,10 +231,12 @@ static int gfs2_ail1_empty(struct gfs2_sbd *sdp)
spin_lock(>sd_ail_lock);
list_for_each_entry_safe_reverse(tr, s, >sd_ail1_list, tr_list) {
gfs2_ail1_empty_one(sdp, tr);
-   if (list_empty(>tr_ail1_list) && oldest_tr)
-   list_move(>tr_list, >sd_ail2_list);
-   else
+   if (list_empty(>tr_ail1_list) && oldest_tr) {
+   if (move_empty_to_ail2)
+   list_move(>tr_list, >sd_ail2_list);
+   } else {
oldest_tr = 0;
+   }
}
ret = list_empty(>sd_ail1_list);
spin_unlock(>sd_ail_lock);
@@ -609,12 +612,12 @@ void gfs2_add_revoke(struct gfs2_sbd *sdp, struct 
gfs2_bufdata *bd)
  
  void gfs2_write_revokes(struct gfs2_sbd *sdp)

  {
-   struct gfs2_trans *tr;
+   struct gfs2_trans *tr, *s;
struct gfs2_bufdata *bd, *tmp;
int have_revokes = 0;
int max_revokes = (sdp->sd_sb.sb_bsize - sizeof(struct 
gfs2_log_descriptor)) / sizeof(u64);
  
-	gfs2_ail1_empty(sdp);

+   gfs2_ail1_empty(sdp, false);
spin_lock(>sd_ail_lock);
list_for_each_entry_reverse(tr, >sd_ail1_list, tr_list) {
list_for_each_entry(bd, >tr_ail2_list, bd_ail_st_list) {
@@ -640,17 +643,20 @@ void gfs2_write_revokes(struct gfs2_sbd *sdp)
}
gfs2_log_lock(sdp);
spin_lock(>sd_ail_lock);
-   list_for_each_entry_reverse(tr, >sd_ail1_list, tr_list) {
+   list_for_each_entry_safe_reverse(tr, s, >sd_ail1_list, tr_list) {
list_for_each_entry_safe(bd, tmp, >tr_ail2_list, 
bd_ail_st_list) {
if (max_revokes == 0)
-   goto out_of_blocks;
+   break;
if (!list_empty(>bd_list))
continue;
gfs2_add_revoke(sdp, bd);
max_revokes--;
}
+   if (list_empty(>tr_ail1_list))
+   list_move(>tr_list, >sd_ail2_list);
+   if (max_revokes == 0)
+   break;
}
-out_of_blocks:
spin_unlock(>sd_ail_lock);
gfs2_log_unlock(sdp);
  
@@ -842,7 +848,7 @@ void gfs2_log_flush(struct gfs2_sbd *sdp, struct gfs2_glock *gl, u32 flags)

for (;;) {
gfs2_ail1_start(sdp);
gfs2_ail1_wait(sdp);
-   if (gfs2_ail1_empty(sdp))
+   if (gfs2_ail1_empty(sdp, true))
break;
}
atomic_dec(>sd_log_blks_free); /* Adjust for 
unreserved buffer */
@@ -1008,7 +1014,7 @@ int gfs2_logd(void *data)
  
  		did_flush = false;

if (gfs2_jrnl_flush_reqd(sdp) || t == 0) {
-   gfs2_ail1_empty(sdp);
+   gfs2_ail1_empty(sdp, true);
if (test_bit(SDF_JOURNAL_LIVE, >sd_flags))
gfs2_log_flush(sdp, NULL,
   GFS2_LOG_HEAD_FLUSH_NORMAL |
@@ -1019,7 +1025,7 @@ int gfs2_logd(void *data)
if (gfs2_ail_flush_reqd(sdp)) {
gfs2_ail1_start(sdp);
gfs2_ail1_wait(sdp);
-   gfs2_ail1_empty(sdp);
+   gfs2_ail1_empty(sdp, true);
if (test_bit(SDF_JOURNAL_LIVE, >sd_flags))

Re: [Cluster-devel] [GFS2 PATCH 4/9] gfs2: Force withdraw to replay journals and wait for it to finish

2019-02-15 Thread Steven Whitehouse

Hi,

On 13/02/2019 15:21, Bob Peterson wrote:

When a node withdraws from a file system, it often leaves its journal
in an incomplete state. This is especially true when the withdraw is
caused by io errors writing to the journal. Before this patch, a
withdraw would try to write a "shutdown" record to the journal, tell
dlm it's done with the file system, and none of the other nodes
know about the problem. Later, when the problem is fixed and the
withdrawn node is rebooted, it would then discover that its own
journal was incomplete, and replay it. However, replaying it at this
point is almost guaranteed to introduce corruption because the other
nodes are likely to have used affected resource groups that appeared
in the journal since the time of the withdraw. Replaying the journal
later will overwrite any changes made, and not through any fault of
dlm, which was instructed during the withdraw to release those
resources.

This patch makes file system withdraws seen by the entire cluster.
Withdrawing nodes dequeue their journal glock to allow recovery.

The remaining nodes check all the journals to see if they are
clean or in need of replay. They try to replay dirty journals, but
only the journals of withdrawn nodes will be "not busy" and
therefore available for replay.

Until the journal replay is complete, no i/o related glocks may be
given out, to ensure that the replay does not cause the
aforementioned corruption: We cannot allow any journal replay to
overwrite blocks associated with a glock once it is held. The
glocks not affected by a withdraw are permitted to be passed
around as normal during a withdraw. A new glops flag, called
GLOF_OK_AT_WITHDRAW, indicates glocks that may be passed around
freely while a withdraw is taking place.

One such glock is the "live" glock which is now used to signal when
a withdraw occurs. When a withdraw occurs, the node signals its
withdraw by dequeueing the "live" glock and trying to enqueue it
in EX mode, thus forcing the other nodes to all see a demote
request, by way of a "1CB" (one callback) try lock. The "live"
glock is not granted in EX; the callback is only just used to
indicate a withdraw has occurred.

Note that all nodes in the cluster must wait for the recovering
node to finish replaying the withdrawing node's journal before
continuing. To this end, it checks that the journals are clean
multiple times in a retry loop.

Signed-off-by: Bob Peterson 


This new algorithm seems rather complicated, so it will need a lot of 
careful testing I think. It would be good if there was some way to 
simplify things a bit here.




---
  fs/gfs2/glock.c  |  35 --
  fs/gfs2/glock.h  |   1 +
  fs/gfs2/glops.c  |  61 +-
  fs/gfs2/incore.h |   6 ++
  fs/gfs2/lock_dlm.c   |  32 ++
  fs/gfs2/log.c|  22 +--
  fs/gfs2/meta_io.c|   2 +-
  fs/gfs2/ops_fstype.c |  48 ++
  fs/gfs2/super.c  |  24 ---
  fs/gfs2/super.h  |   1 +
  fs/gfs2/util.c   | 148 ++-
  fs/gfs2/util.h   |   3 +
  12 files changed, 315 insertions(+), 68 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index c6d6e478f5e3..20fb6cdf7829 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -242,7 +242,8 @@ static void __gfs2_glock_put(struct gfs2_glock *gl)
gfs2_glock_remove_from_lru(gl);
spin_unlock(>gl_lockref.lock);
GLOCK_BUG_ON(gl, !list_empty(>gl_holders));
-   GLOCK_BUG_ON(gl, mapping && mapping->nrpages);
+   GLOCK_BUG_ON(gl, mapping && mapping->nrpages &&
+!test_bit(SDF_SHUTDOWN, >sd_flags));
trace_gfs2_glock_put(gl);
sdp->sd_lockstruct.ls_ops->lm_put_lock(gl);
  }
@@ -543,6 +544,8 @@ __acquires(>gl_lockref.lock)
int ret;
  
  	if (unlikely(withdrawn(sdp)) &&

+   !(glops->go_flags & GLOF_OK_AT_WITHDRAW) &&
+   (gh && !(LM_FLAG_NOEXP & gh->gh_flags)) &&
target != LM_ST_UNLOCKED)
return;
lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP |
@@ -561,9 +564,10 @@ __acquires(>gl_lockref.lock)
(lck_flags & (LM_FLAG_TRY|LM_FLAG_TRY_1CB)))
clear_bit(GLF_BLOCKING, >gl_flags);
spin_unlock(>gl_lockref.lock);
-   if (glops->go_sync)
+   if (glops->go_sync && !test_bit(SDF_SHUTDOWN, >sd_flags))
glops->go_sync(gl);
-   if (test_bit(GLF_INVALIDATE_IN_PROGRESS, >gl_flags))
+   if (test_bit(GLF_INVALIDATE_IN_PROGRESS, >gl_flags) &&
+   !test_bit(SDF_SHUTDOWN, >sd_flags))
glops->go_inval(gl, target == LM_ST_DEFERRED ? 0 : 
DIO_METADATA);
clear_bit(GLF_INVALIDATE_IN_PROGRESS, >gl_flags);
  
@@ -1091,7 +1095,8 @@ int gfs2_glock_nq(struct gfs2_holder *gh)

struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
int error = 0;
  
-	if (unlikely(withdrawn(sdp)))

+   if (unlikely(withdrawn(sdp) && !(LM_FLAG_NOEXP & gh->gh_flags) &&
+   

Re: [Cluster-devel] [GFS2 PATCH 3/9] gfs2: Empty the ail for the glock when rgrps are invalidated

2019-02-15 Thread Steven Whitehouse

Hi,

On 13/02/2019 15:21, Bob Peterson wrote:

Before this patch, function rgrp_go_inval would not invalidate the
ail list, which meant that there might still be buffers outstanding
on the ail that had revokes still pending. If the revokes had still
not been written when the glock was given to another node, and that
node (with outstanding revokes) died for some reason, the resulting
journal replay would replay the un-revoked rgrps, thus wiping out
changes made by the node who rightfully received the rgrp in EX.
This caused metadata corruption.

Signed-off-by: Bob Peterson 


rgrp_go_sync() has a call to gfs2_ail_empty_gl() so there should be no 
revokes to worry about when rgrp_go_inval is called, because everything 
should have already been written to the log & flushed. Has something 
gone wrong in the logic which somehow allows that step to be skipped in 
some cases?


Steve.


---
  fs/gfs2/glops.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 9c86c8004ba7..4b0e52bf5825 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -166,6 +166,8 @@ static void rgrp_go_sync(struct gfs2_glock *gl)
error = filemap_fdatawait_range(mapping, gl->gl_vm.start, 
gl->gl_vm.end);
mapping_set_error(mapping, error);
gfs2_ail_empty_gl(gl);
+   gfs2_assert_withdraw(gl->gl_name.ln_sbd,
+gl->gl_name.ln_sbd->sd_log_error == 0);
  
  	spin_lock(>gl_lockref.lock);

rgd = gl->gl_object;
@@ -196,6 +198,7 @@ static void rgrp_go_inval(struct gfs2_glock *gl, int flags)
WARN_ON_ONCE(!(flags & DIO_METADATA));
gfs2_assert_withdraw(sdp, !atomic_read(>gl_ail_count));
truncate_inode_pages_range(mapping, gl->gl_vm.start, gl->gl_vm.end);
+   gfs2_ail_empty_gl(gl);
  
  	if (rgd)

rgd->rd_flags &= ~GFS2_RDF_UPTODATE;




Re: [Cluster-devel] [GFS2 PATCH 1/9] gfs2: Introduce concept of a pending withdraw

2019-02-15 Thread Steven Whitehouse



On 13/02/2019 15:21, Bob Peterson wrote:

File system withdraws can be delayed when inconsistencies are
discovered when we cannot withdraw immediately, for example, when
critical spin_locks are held. But delaying the withdraw can cause
gfs2 to ignore the error and keep running for a short period of time.
For example, an rgrp glock may be dequeued and demoted while there
are still buffers that haven't been properly revoked, due to io
errors writing to the journal.

This patch introduces a new concept of a delayed withdraw, which
means an inconsistency has been discovered and we need to withdraw
at the earliest possible opportunity. In these cases, we aren't
quite withdrawn yet, but we still need to not dequeue glocks and
other critical things. If we dequeue the glocks and the withdraw
results in our journal being replayed, the replay could overwrite
data that's been modified by a different node that acquired the
glock in the meantime.

Signed-off-by: Bob Peterson 


This withdrawn() wrapper seems like a good plan anyway, since it helps 
make the code more readable,


Steve.



---
  fs/gfs2/aops.c   |  4 ++--
  fs/gfs2/file.c   |  2 +-
  fs/gfs2/glock.c  |  7 +++
  fs/gfs2/glops.c  |  2 +-
  fs/gfs2/incore.h |  1 +
  fs/gfs2/log.c| 20 
  fs/gfs2/meta_io.c|  6 +++---
  fs/gfs2/ops_fstype.c |  3 +--
  fs/gfs2/quota.c  |  2 +-
  fs/gfs2/super.c  |  6 +++---
  fs/gfs2/sys.c|  2 +-
  fs/gfs2/util.c   |  1 +
  fs/gfs2/util.h   |  8 
  13 files changed, 34 insertions(+), 30 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 05dd78f4b2b3..0d3cde8a61cd 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -521,7 +521,7 @@ static int __gfs2_readpage(void *file, struct page *page)
error = mpage_readpage(page, gfs2_block_map);
}
  
-	if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags)))

+   if (unlikely(withdrawn(sdp)))
return -EIO;
  
  	return error;

@@ -638,7 +638,7 @@ static int gfs2_readpages(struct file *file, struct 
address_space *mapping,
gfs2_glock_dq();
  out_uninit:
gfs2_holder_uninit();
-   if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags)))
+   if (unlikely(withdrawn(sdp)))
ret = -EIO;
return ret;
  }
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index a2dea5bc0427..4046f6ac7f13 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1169,7 +1169,7 @@ static int gfs2_lock(struct file *file, int cmd, struct 
file_lock *fl)
cmd = F_SETLK;
fl->fl_type = F_UNLCK;
}
-   if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags))) {
+   if (unlikely(withdrawn(sdp))) {
if (fl->fl_type == F_UNLCK)
locks_lock_file_wait(file, fl);
return -EIO;
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index f66773c71bcd..c6d6e478f5e3 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -542,7 +542,7 @@ __acquires(>gl_lockref.lock)
unsigned int lck_flags = (unsigned int)(gh ? gh->gh_flags : 0);
int ret;
  
-	if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags)) &&

+   if (unlikely(withdrawn(sdp)) &&
target != LM_ST_UNLOCKED)
return;
lck_flags &= (LM_FLAG_TRY | LM_FLAG_TRY_1CB | LM_FLAG_NOEXP |
@@ -579,8 +579,7 @@ __acquires(>gl_lockref.lock)
}
else if (ret) {
fs_err(sdp, "lm_lock ret %d\n", ret);
-   GLOCK_BUG_ON(gl, !test_bit(SDF_SHUTDOWN,
-  >sd_flags));
+   GLOCK_BUG_ON(gl, !withdrawn(sdp));
}
} else { /* lock_nolock */
finish_xmote(gl, target);
@@ -1092,7 +1091,7 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
int error = 0;
  
-	if (unlikely(test_bit(SDF_SHUTDOWN, >sd_flags)))

+   if (unlikely(withdrawn(sdp)))
return -EIO;
  
  	if (test_bit(GLF_LRU, >gl_flags))

diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index f15b4c57c4bd..9c86c8004ba7 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -539,7 +539,7 @@ static int freeze_go_xmote_bh(struct gfs2_glock *gl, struct 
gfs2_holder *gh)
gfs2_consist(sdp);
  
  		/*  Initialize some head of the log stuff  */

-   if (!test_bit(SDF_SHUTDOWN, >sd_flags)) {
+   if (!withdrawn(sdp)) {
sdp->sd_log_sequence = head.lh_sequence + 1;
gfs2_log_pointers_init(sdp, head.lh_blkno);
}
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index cdf07b408f54..8380d4db8be6 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -621,6 +621,7 @@ enum {
SDF_SKIP_DLM_UNLOCK = 8,
SDF_FORCE_AIL_FLUSH = 9,
SDF_AIL1_IO_ERROR   = 10,
+   SDF_PENDING_WITHDRAW= 11, /* Will 

Re: [Cluster-devel] [PATCH 1/2] gfs2: Fix occasional glock use-after-free

2019-01-31 Thread Steven Whitehouse

Hi,

On 31/01/2019 10:55, Ross Lagerwall wrote:

Each gfs2_bufdata stores a reference to a glock but the reference count
isn't incremented. This causes an occasional use-after-free of the
glock. Fix by taking a reference on the glock during allocation and
dropping it when freeing.


Another good bit of debugging. It would be nice if we can (longer term) 
avoid using the ref count though, since that will have some overhead, 
but for the time being, the correctness is the important thing,


Steve.



Found by KASAN:

BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
Write of size 4 at addr 88801aff6134 by task kworker/0:2H/20371

CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
Workqueue: glock_workqueue glock_work_func [gfs2]
Call Trace:
  dump_stack+0x71/0xab
  print_address_description+0x6a/0x270
  kasan_report+0x258/0x380
  ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
  revoke_lo_after_commit+0x8e/0xe0 [gfs2]
  gfs2_log_flush+0x511/0xa70 [gfs2]
  ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
  ? __brelse+0x48/0x50
  ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
  ? gfs2_trans_end+0x18d/0x340 [gfs2]
  gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
  ? inode_go_dump+0xe0/0xe0 [gfs2]
  ? inode_go_sync+0xe4/0x220 [gfs2]
  inode_go_sync+0xe4/0x220 [gfs2]
  do_xmote+0x12b/0x290 [gfs2]
  glock_work_func+0x6f/0x160 [gfs2]
  process_one_work+0x461/0x790
  worker_thread+0x69/0x6b0
  ? process_one_work+0x790/0x790
  kthread+0x1ae/0x1d0
  ? kthread_create_worker_on_cpu+0xc0/0xc0
  ret_from_fork+0x22/0x40

Allocated by task 20805:
  kasan_kmalloc+0xa0/0xd0
  kmem_cache_alloc+0xb5/0x1b0
  gfs2_glock_get+0x14b/0x620 [gfs2]
  gfs2_inode_lookup+0x20c/0x640 [gfs2]
  gfs2_dir_search+0x150/0x180 [gfs2]
  gfs2_lookupi+0x272/0x360 [gfs2]
  __gfs2_lookup+0x8b/0x1d0 [gfs2]
  gfs2_atomic_open+0x77/0x100 [gfs2]
  path_openat+0x1454/0x1c10
  do_filp_open+0x124/0x1d0
  do_sys_open+0x213/0x2c0
  do_syscall_64+0x69/0x160
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 0:
  __kasan_slab_free+0x130/0x180
  kmem_cache_free+0x78/0x1e0
  rcu_process_callbacks+0x2ad/0x6c0
  __do_softirq+0x111/0x38c

The buggy address belongs to the object at 88801aff6040
  which belongs to the cache gfs2_glock(aspace) of size 560
The buggy address is located 244 bytes inside of
  560-byte region [88801aff6040, 88801aff6270)
...

Signed-off-by: Ross Lagerwall 
---
  fs/gfs2/aops.c| 3 +--
  fs/gfs2/lops.c| 2 +-
  fs/gfs2/meta_io.c | 2 +-
  fs/gfs2/trans.c   | 9 -
  fs/gfs2/trans.h   | 2 ++
  5 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 05dd78f4b2b3..8c2b572a7fb1 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -868,9 +868,8 @@ int gfs2_releasepage(struct page *page, gfp_t gfp_mask)
gfs2_assert_warn(sdp, bd->bd_bh == bh);
if (!list_empty(>bd_list))
list_del_init(>bd_list);
-   bd->bd_bh = NULL;
bh->b_private = NULL;
-   kmem_cache_free(gfs2_bufdata_cachep, bd);
+   gfs2_free_bufdata(bd);
}
  
  		bh = bh->b_this_page;

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 94dcab655bc0..f40be71677d1 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -847,7 +847,7 @@ static void revoke_lo_after_commit(struct gfs2_sbd *sdp, 
struct gfs2_trans *tr)
gl = bd->bd_gl;
atomic_dec(>gl_revokes);
clear_bit(GLF_LFLUSH, >gl_flags);
-   kmem_cache_free(gfs2_bufdata_cachep, bd);
+   gfs2_free_bufdata(bd);
}
  }
  
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c

index be9c0bf697fe..868caa0eb104 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -355,7 +355,7 @@ void gfs2_remove_from_journal(struct buffer_head *bh, int 
meta)
gfs2_trans_add_revoke(sdp, bd);
} else if (was_pinned) {
bh->b_private = NULL;
-   kmem_cache_free(gfs2_bufdata_cachep, bd);
+   gfs2_free_bufdata(bd);
}
spin_unlock(>sd_ail_lock);
}
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index cd9a94a6b5bb..423cbee8fa08 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -133,9 +133,16 @@ static struct gfs2_bufdata *gfs2_alloc_bufdata(struct 
gfs2_glock *gl,
bd->bd_gl = gl;
INIT_LIST_HEAD(>bd_list);
bh->b_private = bd;
+   gfs2_glock_hold(gl);
return bd;
  }
  
+void gfs2_free_bufdata(struct gfs2_bufdata *bd)

+{
+   gfs2_glock_put(bd->bd_gl);
+   kmem_cache_free(gfs2_bufdata_cachep, bd);
+}
+
  /**
   * gfs2_trans_add_data - Add a databuf to the transaction.
   * @gl: The inode glock associated with the buffer
@@ -265,7 +272,7 @@ void gfs2_trans_add_unrevoke(struct gfs2_sbd *sdp, u64 

Re: [Cluster-devel] [PATCH 2/2] gfs2: Fix lru_count going negative

2019-01-31 Thread Steven Whitehouse

Hi,

On 31/01/2019 10:55, Ross Lagerwall wrote:

Under certain conditions, lru_count may drop below zero resulting in
a large amount of log spam like this:

vmscan: shrink_slab: gfs2_dump_glock+0x3b0/0x630 [gfs2] \
 negative objects to delete nr=-1

This happens as follows:
1) A glock is moved from lru_list to the dispose list and lru_count is
decremented.
2) The dispose function calls cond_resched() and drops the lru lock.
3) Another thread takes the lru lock and tries to add the same glock to
lru_list, checking if the glock is on an lru list.
4) It is on a list (actually the dispose list) and so it avoids
incrementing lru_count.
5) The glock is moved to lru_list.
5) The original thread doesn't dispose it because it has been re-added
to the lru list but the lru_count has still decreased by one.

Fix by checking if the LRU flag is set on the glock rather than checking
if the glock is on some list and rearrange the code so that the LRU flag
is added/removed precisely when the glock is added/removed from lru_list.

Signed-off-by: Ross Lagerwall 


I'm glad we've got to the bottom of that one. Excellent work debugging 
that! Many thanks for the fix,


Steve.



---
  fs/gfs2/glock.c | 16 +---
  1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index b92740edc416..53e6c7e0c1b3 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -185,13 +185,14 @@ void gfs2_glock_add_to_lru(struct gfs2_glock *gl)
  {
spin_lock(_lock);
  
-	if (!list_empty(>gl_lru))

-   list_del_init(>gl_lru);
-   else
+   list_del(>gl_lru);
+   list_add_tail(>gl_lru, _list);
+
+   if (!test_bit(GLF_LRU, >gl_flags)) {
+   set_bit(GLF_LRU, >gl_flags);
atomic_inc(_count);
+   }
  
-	list_add_tail(>gl_lru, _list);

-   set_bit(GLF_LRU, >gl_flags);
spin_unlock(_lock);
  }
  
@@ -201,7 +202,7 @@ static void gfs2_glock_remove_from_lru(struct gfs2_glock *gl)

return;
  
  	spin_lock(_lock);

-   if (!list_empty(>gl_lru)) {
+   if (test_bit(GLF_LRU, >gl_flags)) {
list_del_init(>gl_lru);
atomic_dec(_count);
clear_bit(GLF_LRU, >gl_flags);
@@ -1456,6 +1457,7 @@ __acquires(_lock)
if (!spin_trylock(>gl_lockref.lock)) {
  add_back_to_lru:
list_add(>gl_lru, _list);
+   set_bit(GLF_LRU, >gl_flags);
atomic_inc(_count);
continue;
}
@@ -1463,7 +1465,6 @@ __acquires(_lock)
spin_unlock(>gl_lockref.lock);
goto add_back_to_lru;
}
-   clear_bit(GLF_LRU, >gl_flags);
gl->gl_lockref.count++;
if (demote_ok(gl))
handle_callback(gl, LM_ST_UNLOCKED, 0, false);
@@ -1498,6 +1499,7 @@ static long gfs2_scan_glock_lru(int nr)
if (!test_bit(GLF_LOCK, >gl_flags)) {
list_move(>gl_lru, );
atomic_dec(_count);
+   clear_bit(GLF_LRU, >gl_flags);
freed++;
continue;
}




Re: [Cluster-devel] kernel BUG at fs/gfs2/inode.h:64

2019-01-09 Thread Steven Whitehouse

Hi,

On 09/01/2019 17:14, Tim Smith wrote:

On Wednesday, 9 January 2019 15:35:05 GMT Andreas Gruenbacher wrote:

On Wed, 9 Jan 2019 at 14:43, Mark Syms  wrote:

We don't yet know how the assert got triggered as we've only seen it once
and in the original form it looks like it would be very hard to trigger in
any normal case (given that in default usage i_blocks should be at least 8
times what any putative value for change could be). So, for the assert to
have triggered we've been asked to remove at least 8 times the number of
blocks currently allocated to the inode. Possible causes could be a double
release or some other higher level bug that will require further
investigation to uncover.

The following change has at least survived xfstests:

--- a/fs/gfs2/inode.h
+++ b/fs/gfs2/inode.h
@@ -61,8 +61,8 @@ static inline u64 gfs2_get_inode_blocks(const struct
inode *inode)

  static inline void gfs2_add_inode_blocks(struct inode *inode, s64 change)
  {
-gfs2_assert(GFS2_SB(inode), (change >= 0 || inode->i_blocks >
-change));
-change *= (GFS2_SB(inode)->sd_sb.sb_bsize/GFS2_BASIC_BLOCK);
+change <<= inode->i_blkbits - 9;
+gfs2_assert(GFS2_SB(inode), change >= 0 || inode->i_blocks >= -change);
inode->i_blocks += change;
  }

Andreas

I'll use

change <<= (GFS2_SB(inode)->sd_sb.sb_bsize_shift - GFS2_BASIC_BLOCK_SHIFT);

for consistency with the gfs2_get/set_inode_blocks(). I'll send the patch in a
bit.

Given what it was like before, either i_blocks was already 0 or -change
somehow became stupidly huge. Anything else looks like it would be hard to
reproduce without mkfs.gfs2 -b 512 (so that GFS2_SB(inode)->sd_sb.sb_bsize ==
GFS2_BASIC_BLOCK) which we don't do.

I'll try to work out what could have caused it and see if we can provoke it
again.

Out of curiosity I did a few tests where I created a file on GFS2, copied /
dev/null on top of it, and then ran stat on the file. It seems like GFS2 never
frees the last allocation on truncate; stat always reports 1, 2, 4 or 8 blocks
in use for a zero-length file depending on the underlying filesystem block
size, unlike (say) ext3/4 where it reports 0. I presume this is intentional so
maybe some corner case where it *is* trying to do that is the root of the
problem.

That is because gfs2 uses a block for each inode, so it makes sense to 
account for it in that way. For ext* the inodes are in separate areas of 
the disk, and they only use up a partial block, and they are also 
counted separately too. So it is a historical design difference I think,


Steve.



Re: [Cluster-devel] [GFS2 PATCH] gfs2: Implement special writepage for ail start operations

2019-01-03 Thread Steven Whitehouse

Hi,

On 02/01/2019 21:12, Bob Peterson wrote:

Hi,

This patch is a working prototype to fix the hangs that result from
doing certain jdata I/O, primarily xfstests/269.

My earlier prototypes used a special "fs_specific" wbc flag that I suspect
the upstream community wouldn't like. So this version is a bit bigger and
more convoluted but accomplishes the same basic thing without the special
wbc flag.

It works by implementing a special new version of writepage that's used
for writing pages from an ail sync operation, as opposed to writes done
on behalf of an inode write operation. So far I've done more than 125
iterations of test 269 which consistently failed on the first iteration
before the patch.

Since jdata and ordered pages can both be put on the ail lists, I had
to add a special check in function start_writepage to see how to handle
the page.

It may not be perfect, but I wanted to get reactions from developers
to see if I'm off base or forgetting something.

Signed-off-by: Bob Peterson 
---
  fs/gfs2/aops.c  | 28 +++-
  fs/gfs2/aops.h  |  4 
  fs/gfs2/log.c   | 37 +++--
  fs/gfs2/log.h   |  3 ++-
  fs/gfs2/super.c |  2 +-
  5 files changed, 65 insertions(+), 9 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 05dd78f4b2b3..bd6cb49f3b9a 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -192,6 +192,32 @@ static int __gfs2_jdata_writepage(struct page *page, 
struct writeback_control *w
return gfs2_write_full_page(page, gfs2_get_block_noalloc, wbc);
  }
  
+/**

+ * gfs2_ail_writepage - Write complete page from the ail list
+ * @page: Page to write
+ * @wbc: The writeback control
+ *
+ * Returns: errno
+ *
+ */
+
+int gfs2_ail_writepage(struct page *page, struct writeback_control *wbc)
+{
+   struct inode *inode = page->mapping->host;
+   struct gfs2_inode *ip = GFS2_I(inode);
+   struct gfs2_sbd *sdp = GFS2_SB(inode);
+
+   if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
+   goto out;
+   if (current->journal_info == NULL)
+   return gfs2_write_full_page(page, gfs2_get_block_noalloc, wbc);
+
+   redirty_page_for_writepage(wbc, page);
+out:
+   unlock_page(page);
+   return 0;
+}
+


Since this is only called from the ail flushing code, we cannot be in a 
transaction and in any case we do not want to defer the writeback, so 
this whole thing can be replaced by a simple call to gfs2_write_full_page()




  /**
   * gfs2_jdata_writepage - Write complete page
   * @page: Page to write
@@ -201,7 +227,7 @@ static int __gfs2_jdata_writepage(struct page *page, struct 
writeback_control *w
   *
   */
  
-static int gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc)

+int gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc)
  {
struct inode *inode = page->mapping->host;
struct gfs2_inode *ip = GFS2_I(inode);
diff --git a/fs/gfs2/aops.h b/fs/gfs2/aops.h
index fa8e5d0144dd..cbaaa372edc4 100644
--- a/fs/gfs2/aops.h
+++ b/fs/gfs2/aops.h
@@ -15,5 +15,9 @@ extern int gfs2_stuffed_write_end(struct inode *inode, struct 
buffer_head *dibh,
  extern void adjust_fs_space(struct inode *inode);
  extern void gfs2_page_add_databufs(struct gfs2_inode *ip, struct page *page,
   unsigned int from, unsigned int len);
+extern int gfs2_jdata_writepage(struct page *page,
+   struct writeback_control *wbc);
+extern int gfs2_ail_writepage(struct page *page,
+ struct writeback_control *wbc);
  
  #endif /* __AOPS_DOT_H__ */

diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 5bfaf381921a..1c70471fb07d 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -23,6 +23,7 @@
  #include 
  #include 
  
+#include "aops.h"

  #include "gfs2.h"
  #include "incore.h"
  #include "bmap.h"
@@ -82,18 +83,40 @@ static void gfs2_remove_from_ail(struct gfs2_bufdata *bd)
brelse(bd->bd_bh);
  }
  
+/*

+ * Function used by generic_writepages to call the real writepage
+ * function and set the mapping flags on error
+ */
+static int start_writepage(struct page *page, struct writeback_control *wbc,
+  void *data)
+{
+   struct address_space *mapping = page->mapping;
+   int ail_start = *(int *)data;
+   int ret;
+   int (*writepage) (struct page *page, struct writeback_control *wbc);
+
+   writepage = mapping->a_ops->writepage;
+
+   if (ail_start && writepage == gfs2_jdata_writepage)
+   writepage = gfs2_ail_writepage;
+   ret = writepage(page, wbc);
+   mapping_set_error(mapping, ret);
+   return ret;
+}
+
If you are passing this directly to write_cache_pages() when we need it, 
then why do we need the switch to a different writepage here? In both 
cases we are doing the same thing - writing back in-place data after it 
has been through the journal, so we should be able to use the same 

Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing

2018-12-19 Thread Steven Whitehouse

Hi,

On 18/12/2018 16:09, Mark Syms wrote:

Thanks Bob,

We believe we have seen these issues from time to time in our automated testing 
but I suspect that they're indicating a configuration problem with the backing 
storage. For flexibility a proportion of our purely functional testing will use 
storage provided by a VM running a software iSCSI target and these tests seem 
to be somewhat susceptible to getting I/O errors, some of which will inevitably 
end up being in the journal. If we start to see a lot we'll need to look at the 
config of the VMs first I think.

Mark.


I think there are a few things here... firstly Bob is right that in 
general if we are going to retry I/O, then this would be done at the 
block layer, by multipath for example. However, having a way to 
gracefully deal with failure aside from fencing/rebooting a node is useful.


One issue with that is tracking outstanding I/O. For the journal we do 
that anyway, since we count the number of in flight I/Os. In other cases 
this is more difficult, for example where we use the VFS library 
functions for readpages/writepages. If we were able to track all the I/O 
that GFS2 produces and be certain to be able to turn off future I/O (or 
writes at least) internally then we could avoid using the dm based 
solution for withdraw that we currently have. That would be an 
improvement in terms of reliability.


The other issue is the one that Bob has been looking at, namely a way to 
signal that recovery is due, but without requiring fencing. If we can 
solve both of those issues, then that would certainly go a long way 
towards improving this,


Steve.




-Original Message-
From: Bob Peterson 
Sent: 18 December 2018 15:52
To: Mark Syms 
Cc: cluster-devel@redhat.com
Subject: Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs 
writing

- Original Message -

Hi Bob,

I agree, it's a hard problem. I'm just trying to understand that we've
done the absolute best we can and that if this condition is hit then
the best solution really is to just kill the node. I guess it's also a
question of how common this actually ends up being. We have now got
customers starting to use GFS2 for VM storage on XenServer so I guess
we'll just have to see how many support calls we get in on it.

Thanks,

Mark.

Hi Mark,

I don't expect the problem to be very common in the real world.
The user has to get IO errors while writing to the GFS2 journal, which is not 
very common. The patch is basically reacting to a phenomenon we recently 
started noticing in which the HBA (qla2xxx) driver shuts down and stops 
accepting requests when you do abnormal reboots (which we sometimes do to test 
node recovery). In these cases, the node doesn't go down right away.
It stays up just long enough to cause IO errors with subsequent withdraws, 
which, we discovered, results in file system corruption.
Normal reboots, "/sbin/reboot -fin", and "echo b > /proc/sysrq-trigger" should 
not have this problem, nor should node fencing, etc.

And like I said, I'm open to suggestions on how to fix it. I wish there was a 
better solution.

As it is, I'd kind of like to get something into this merge window for the 
upstream kernel, but I'll need to submit the pull request for that probably 
tomorrow or Thursday. If we find a better solution, we can always revert these 
changes and implement a new one.

Regards,

Bob Peterson





Re: [Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing to the journal

2018-12-17 Thread Steven Whitehouse

Hi,

On 17/12/2018 09:04, Edwin Török wrote:

On 17/12/2018 13:54, Bob Peterson wrote:

Hi,

Before this patch, gfs2 would try to withdraw when it encountered
io errors writing to its journal. That's incorrect behavior
because if it can't write to the journal, it cannot write revokes
for the metadata it sends down. A withdraw will cause gfs2 to
unmount the file system from dlm, which is a controlled shutdown,
but the io error means it cannot write the UNMOUNT log header
to the journal. The controlled shutdown will cause dlm to release
all its locks, allowing other nodes to update the metadata.
When the node rejoins the cluster and sees no UNMOUNT log header
it will see the journal is dirty and replay it, but after the
other nodes may have changed the metadata, thus corrupting the
file system.

If we get an io error writing to the journal, the only correct
thing to do is to kernel panic.

Hi,

That may be required for correctness, however are we sure there is no
other way to force the DLM recovery (or can another mechanism be
introduced)?
Consider that there might be multiple GFS2 filesystems mounted from
different iSCSI backends, just because one of them encountered an I/O
error the other ones may still be good to continue.
(Also the host might have other filesystems mounted: local, NFS, it
might still be able to perform I/O on those, so bringing the whole host
down would be best avoided).

Best regards,
--Edwin


Indeed. I think the issue here is that we need to ensure that the other 
cluster nodes understand what has happened. At the moment the mechanism 
for that is that the node is fenced, so panicing, while it is not ideal 
does at least mean that will definitely happen.


I agree though that we want something better longer term,

Steve.


That will force dlm to go through
its full recovery process on the other cluster nodes, freeze all
locks, and make sure the journal is replayed by a node in the
cluster before any other nodes get the affected locks and try to
modify the metadata in the unfinished portion of the journal.

This patch changes the behavior so that io errors encountered
in the journals cause an immediate kernel panic with a message.
However, quota update errors are still allowed to withdraw as
before.

Signed-off-by: Bob Peterson 
---
  fs/gfs2/lops.c | 8 +++-
  1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 94dcab655bc0..44b85f7675d4 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -209,11 +209,9 @@ static void gfs2_end_log_write(struct bio *bio)
struct page *page;
int i;
  
-	if (bio->bi_status) {

-   fs_err(sdp, "Error %d writing to journal, jid=%u\n",
-  bio->bi_status, sdp->sd_jdesc->jd_jid);
-   wake_up(>sd_logd_waitq);
-   }
+   if (bio->bi_status)
+   panic("Error %d writing to journal, jid=%u\n", bio->bi_status,
+ sdp->sd_jdesc->jd_jid);
  
  	bio_for_each_segment_all(bvec, bio, i) {

page = bvec->bv_page;





Re: [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process (v2)

2018-11-22 Thread Steven Whitehouse

Hi,


On 21/11/18 18:52, Bob Peterson wrote:

Hi,

This is a second draft of a two-patch set to fix some of the nasty
journal recovery problems I've found lately.

The original post from 08 November had horribly bad and inaccurate
comments, and Steve Whitehouse and Andreas Gruenbacher pointed out.
This version is hopefully better and more accurately describes what
the patches do and how they work. Also, I fixed a superblock flag
that was improperly declared as a glock flag.

Other than the renamed and re-valued superblock flag, the code
remains unchanged from the previous version. It probably needs a bit
more testing, but it seems to work well.
---
The problems have to do with file system corruption caused when recovery
replays a journal after the resource group blocks have been unlocked
by the recovery process. In other words, when no cluster node takes
responsibility to replay the journal of a withdrawing node, then it
gets replayed later on, after the blocks contents have been changed.

The first patch prevents gfs2 from attempting recovery if the file system
is withdrawn or has journal IO errors. Trying to recover your own journal
from either of these unstable conditions is dangerous and likely to corrupt
the file system.

The second patch is more extensive. When a node withdraws from a file system
it signals all other nodes with the file system mounted to perform recovery
on its journal, since it cannot safely recover its own journal. This is
accomplished by a new non-disk callback glop used exclusively by the
"live" glock, which sets up an lvb in the glock to indicate which
journal(s) need to be recovered.

Regards,

Bob Peterson
---
Bob Peterson (2):
   gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
   gfs2: initiate journal recovery as soon as a node withdraws

  fs/gfs2/glock.c|  5 ++-
  fs/gfs2/glops.c| 47 +++
  fs/gfs2/incore.h   |  3 ++
  fs/gfs2/lock_dlm.c | 95 ++
  fs/gfs2/log.c  | 62 --
  fs/gfs2/super.c|  5 ++-
  fs/gfs2/super.h|  1 +
  fs/gfs2/util.c | 84 
  fs/gfs2/util.h | 13 +++
  9 files changed, 282 insertions(+), 33 deletions(-)


Yes, that looks a bit cleaner now,

Steve.



Re: [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process

2018-11-20 Thread Steven Whitehouse

Hi,


On 08/11/18 20:25, Bob Peterson wrote:

Hi,

This is a first draft of a two-patch set to fix some of the nasty
journal recovery problems I've found lately.

The problems have to do with file system corruption caused when recovery
replays a journal after the resource group blocks have been unlocked
by the recovery process. In other words, when no cluster node takes
responsibility to replay the journal of a withdrawing node, then it
gets replayed later on, after the blocks contents have been changed.

The first patch prevents gfs2 from attempting recovery if the file system
is withdrawn or has journal IO errors. Trying to recover your own journal
from either of these unstable conditions is dangerous and likely to corrupt
the file system.

That sounds sensible to me.


The second patch is more extensive. When a node withdraws from a file system
it first empties out all ourstanding pages in the ail lists, then it
How are we doing this? Since the disk can no longer be written to, there 
are two cases we need to cover. One is for dirty but not yet written 
pages. The other for pages in flight - these will need to either time 
out or complete somehow.



signals all other nodes with the file system mounted to perform recovery
on its journal since it cannot safely recover its own journal. This is
accomplished by a new non-disk callback glop used exclusively by the
"live" glock, which sets up an lvb in the glock to indicate which
journal(s) need to be replayed. This sytem makes it necessary to prevent
recursion, since the journal operations themselves (i.e. the ones that
empty out the ail list on withdraw) can also withdraw. Thus, the withdraw
We should ignore any further I/O errors after we have withdrawn I think, 
since we know that no further disk writes can take place anyway. These 
will be completed as EIO by dm. As you say we definitely don't want the 
node that is withdrawing replaying its own journal. That should be done 
by the remaining nodes in the cluster.


The other question is should we just use the "normal" recovery process 
which would fence the withdrawn node, or whether we should have a 
different system which avoids the fencing, since we have effectively 
self-fenced from the storage. Looking at the patch I assume that perhaps 
this implements the latter?


Steve.



system is now separated into "journal" and "non-journal" withdraws.
Also, the "withdraw" flag is now replaced by a superblock bit because
once the file system withdraws in this way, it needs to remember that from
that point on.

Regards,

Bob Peterson
---
Bob Peterson (2):
   gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
   gfs2: initiate journal recovery as soon as a node withdraws

  fs/gfs2/glock.c|  5 ++-
  fs/gfs2/glops.c| 47 +++
  fs/gfs2/incore.h   |  3 ++
  fs/gfs2/lock_dlm.c | 95 ++
  fs/gfs2/log.c  | 62 --
  fs/gfs2/super.c|  5 ++-
  fs/gfs2/super.h|  1 +
  fs/gfs2/util.c | 84 
  fs/gfs2/util.h | 13 +++
  9 files changed, 282 insertions(+), 33 deletions(-)





Re: [Cluster-devel] [PATCH 03/13] GFS2: Eliminate a goto in finish_xmote

2018-11-20 Thread Steven Whitehouse

Hi,


On 19/11/18 21:26, Bob Peterson wrote:

Hi,

- Original Message -


On 19/11/18 13:29, Bob Peterson wrote:

This is another baby step toward a better glock state machine.
This patch eliminates a goto in function finish_xmote so we can
begin unraveling the cryptic logic with later patches.

Signed-off-by: Bob Peterson 
---
   fs/gfs2/glock.c | 11 +--
   1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 5f2156f15f05..6e9d53583b73 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -472,11 +472,11 @@ static void finish_xmote(struct gfs2_glock *gl,
unsigned int ret)
list_move_tail(>gh_list, 
>gl_holders);
gh = find_first_waiter(gl);
gl->gl_target = gh->gh_state;
-   goto retry;
-   }
-   /* Some error or failed "try lock" - report it */
-   if ((ret & LM_OUT_ERROR) ||
-   (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB))) {
+   state = LM_ST_UNLOCKED;

I'm not sure what you are trying to achieve here, but setting the state
to LM_ST_UNLOCKED when it is quite possible that it is not that state,
doesn't really seem to improve anything. Indeed, it looks more confusing
to me, at least it was fairly clear before that the intent was to retry
the operation which has been canceled.

When finish_xmote hits this affected section of code, it's because the dlm
returned a state different from the intended state. Before this patch, it
did "goto retry" which jumps to the label inside the switch state that
handles LM_ST_UNLOCKED, after which it simply unlocks and returns.

Changing local variable "state" merely forces the code to take the same
codepath in which it calls do_xmote, unlocking and returning as it does today,
but without the goto. This makes the function more suitable to the new
autonomous state machine, which is added in a later patch.

The addition of "else if" is needed so it doesn't go down the wrong code path
at the comment: /* Some error or failed "try lock" - report it */
The logic is a bit tricky here, but is preserved from the original.

Most of the subsequent patches aren't quite as mind-bending, I promise. :)

Regards,

Bob Peterson
I can see that it is doing the same thing as before, but it is less 
clear why. The point about the retry label is that it is telling us what 
is going to do. Setting the state to LM_ST_UNLOCKED is more confusing, 
because the state might not be LM_ST_UNLOCKED at this point, and you are 
now forcing that state in order to get the same code path as before. 
There is no real advantage compared with the previous code that I can 
see, except that it is more confusing,


Steve.



Re: [Cluster-devel] [PATCH 02/13] GFS2: Make do_xmote determine its own gh parameter

2018-11-20 Thread Steven Whitehouse

Hi,


On 19/11/18 21:06, Bob Peterson wrote:

Hi Steve,

- Original Message -


On 19/11/18 13:29, Bob Peterson wrote:

This is another baby step toward a better glock state machine.
Before this patch, do_xmote was called with a gh parameter, but
only for promotes, not demotes. This patch allows do_xmote to
determine the gh autonomously.

Signed-off-by: Bob Peterson 

(snip)


Since gh is apparently only used to get the lock flags, it would make
more sense just to pass the lock flags rather than add in an additional
find_first_waiter() call,

Steve.

Perhaps I didn't put enough info into the comments for this patch.

I need to get rid of the gh parameter in order to make the glock
state machine fully autonomous. In other words, function do_xmote will
become a state in the (stand alone) state machine, which itself does not
require a gh parameter and may be called from several places under
several conditions. The state of the glock will determine that it needs
to call do_xmote, but do_xmote needs to figure it out on its own.
A function can't become a state in this sense. The state in this case is 
the content of struct gfs2_glock, and the

functions define how you get from one state to another,



Before this patch, the caller does indeed know the gh pointer, but in
the future, it will replaced by a generic call to the state machine
which will not know it.

Regards,

Bob Peterson


That is not relevant to the point I was making though. The point is that 
if the flags are passed to do_xmote rather than the gh, then that 
resolves the issue of needing to pass the gh and reduces the amount of 
code, since you can pass 0 flags instead of NULL gh,


Steve.



Re: [Cluster-devel] gfs2: Remove vestigial bd_ops (version 2)

2018-11-20 Thread Steven Whitehouse




On 20/11/18 14:37, Bob Peterson wrote:

Hi,

Here is a new and improved version of the patch I posted on
16 November. Since the field is no longer needed, neither are
the function parameters used to allocate a bd.
---
Field bd_ops was set but never used, so I removed it, and all
code supporting it.

Signed-off-by: Bob Peterson 

Acked-by: Steven Whitehouse 

Steve.


---
  fs/gfs2/incore.h | 1 -
  fs/gfs2/log.c| 1 -
  fs/gfs2/trans.c  | 8 +++-
  3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 888b62cfd6d1..663759abe60d 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -165,7 +165,6 @@ struct gfs2_bufdata {
u64 bd_blkno;
  
  	struct list_head bd_list;

-   const struct gfs2_log_operations *bd_ops;
  
  	struct gfs2_trans *bd_tr;

struct list_head bd_ail_st_list;
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 4dcd2b48189e..5bfaf381921a 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -605,7 +605,6 @@ void gfs2_add_revoke(struct gfs2_sbd *sdp, struct 
gfs2_bufdata *bd)
bd->bd_blkno = bh->b_blocknr;
gfs2_remove_from_ail(bd); /* drops ref on bh */
bd->bd_bh = NULL;
-   bd->bd_ops = _revoke_lops;
sdp->sd_log_num_revoke++;
atomic_inc(>gl_revokes);
set_bit(GLF_LFLUSH, >gl_flags);
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index 423bc2d03dd8..cd9a94a6b5bb 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -124,15 +124,13 @@ void gfs2_trans_end(struct gfs2_sbd *sdp)
  }
  
  static struct gfs2_bufdata *gfs2_alloc_bufdata(struct gfs2_glock *gl,

-  struct buffer_head *bh,
-  const struct gfs2_log_operations 
*lops)
+  struct buffer_head *bh)
  {
struct gfs2_bufdata *bd;
  
  	bd = kmem_cache_zalloc(gfs2_bufdata_cachep, GFP_NOFS | __GFP_NOFAIL);

bd->bd_bh = bh;
bd->bd_gl = gl;
-   bd->bd_ops = lops;
INIT_LIST_HEAD(>bd_list);
bh->b_private = bd;
return bd;
@@ -169,7 +167,7 @@ void gfs2_trans_add_data(struct gfs2_glock *gl, struct 
buffer_head *bh)
gfs2_log_unlock(sdp);
unlock_buffer(bh);
if (bh->b_private == NULL)
-   bd = gfs2_alloc_bufdata(gl, bh, _databuf_lops);
+   bd = gfs2_alloc_bufdata(gl, bh);
else
bd = bh->b_private;
lock_buffer(bh);
@@ -210,7 +208,7 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct 
buffer_head *bh)
unlock_buffer(bh);
lock_page(bh->b_page);
if (bh->b_private == NULL)
-   bd = gfs2_alloc_bufdata(gl, bh, _buf_lops);
+   bd = gfs2_alloc_bufdata(gl, bh);
else
bd = bh->b_private;
unlock_page(bh->b_page);





Re: [Cluster-devel] [PATCH 02/13] GFS2: Make do_xmote determine its own gh parameter

2018-11-19 Thread Steven Whitehouse




On 19/11/18 13:29, Bob Peterson wrote:

This is another baby step toward a better glock state machine.
Before this patch, do_xmote was called with a gh parameter, but
only for promotes, not demotes. This patch allows do_xmote to
determine the gh autonomously.

Signed-off-by: Bob Peterson 
---
  fs/gfs2/glock.c | 12 ++--
  1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 692784faa464..5f2156f15f05 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -60,7 +60,7 @@ struct gfs2_glock_iter {
  
  typedef void (*glock_examiner) (struct gfs2_glock * gl);
  
-static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned int target);

+static void do_xmote(struct gfs2_glock *gl, unsigned int target);
  
  static struct dentry *gfs2_root;

  static struct workqueue_struct *glock_workqueue;
@@ -486,12 +486,12 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
/* Unlocked due to conversion deadlock, try again */
case LM_ST_UNLOCKED:
  retry:
-   do_xmote(gl, gh, gl->gl_target);
+   do_xmote(gl, gl->gl_target);
break;
/* Conversion fails, unlock and try again */
case LM_ST_SHARED:
case LM_ST_DEFERRED:
-   do_xmote(gl, gh, LM_ST_UNLOCKED);
+   do_xmote(gl, LM_ST_UNLOCKED);
break;
default: /* Everything else */
fs_err(gl->gl_name.ln_sbd, "wanted %u got %u\n",
@@ -528,17 +528,17 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
  /**
   * do_xmote - Calls the DLM to change the state of a lock
   * @gl: The lock state
- * @gh: The holder (only for promotes)
   * @target: The target lock state
   *
   */
  
-static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned int target)

+static void do_xmote(struct gfs2_glock *gl, unsigned int target)
  __releases(>gl_lockref.lock)
  __acquires(>gl_lockref.lock)
  {
const struct gfs2_glock_operations *glops = gl->gl_ops;
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
+   struct gfs2_holder *gh = find_first_waiter(gl);
unsigned int lck_flags = (unsigned int)(gh ? gh->gh_flags : 0);
int ret;
  
@@ -659,7 +659,7 @@ __acquires(>gl_lockref.lock)

if (!(gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)))
do_error(gl, 0); /* Fail queued try locks */
}
-   do_xmote(gl, gh, gl->gl_target);
+   do_xmote(gl, gl->gl_target);
return;
  }
  
Since gh is apparently only used to get the lock flags, it would make 
more sense just to pass the lock flags rather than add in an additional 
find_first_waiter() call,


Steve.



Re: [Cluster-devel] [PATCH 03/13] GFS2: Eliminate a goto in finish_xmote

2018-11-19 Thread Steven Whitehouse




On 19/11/18 13:29, Bob Peterson wrote:

This is another baby step toward a better glock state machine.
This patch eliminates a goto in function finish_xmote so we can
begin unraveling the cryptic logic with later patches.

Signed-off-by: Bob Peterson 
---
  fs/gfs2/glock.c | 11 +--
  1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 5f2156f15f05..6e9d53583b73 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -472,11 +472,11 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
list_move_tail(>gh_list, 
>gl_holders);
gh = find_first_waiter(gl);
gl->gl_target = gh->gh_state;
-   goto retry;
-   }
-   /* Some error or failed "try lock" - report it */
-   if ((ret & LM_OUT_ERROR) ||
-   (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB))) {
+   state = LM_ST_UNLOCKED;
I'm not sure what you are trying to achieve here, but setting the state 
to LM_ST_UNLOCKED when it is quite possible that it is not that state, 
doesn't really seem to improve anything. Indeed, it looks more confusing 
to me, at least it was fairly clear before that the intent was to retry 
the operation which has been canceled.



+   } else if ((ret & LM_OUT_ERROR) ||
+  (gh->gh_flags & (LM_FLAG_TRY |
+   LM_FLAG_TRY_1CB))) {
+   /* An error or failed "try lock" - report it */
gl->gl_target = gl->gl_state;
do_error(gl, ret);
goto out;
@@ -485,7 +485,6 @@ static void finish_xmote(struct gfs2_glock *gl, unsigned 
int ret)
switch(state) {
/* Unlocked due to conversion deadlock, try again */
case LM_ST_UNLOCKED:
-retry:
do_xmote(gl, gl->gl_target);
break;
/* Conversion fails, unlock and try again */




Re: [Cluster-devel] [DLM PATCH] dlm: Don't swamp the CPU with callbacks queued during recovery

2018-11-09 Thread Steven Whitehouse

Hi,


On 08/11/18 19:04, Bob Peterson wrote:

Hi,

Before this patch, recovery would cause all callbacks to be delayed,
put on a queue, and afterward they were all queued to the callback
work queue. This patch does the same thing, but occasionally takes
a break after 25 of them so it won't swamp the CPU at the expense
of other RT processes like corosync.

Signed-off-by: Bob Peterson 
---
  fs/dlm/ast.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/fs/dlm/ast.c b/fs/dlm/ast.c
index 562fa8c3edff..47ee66d70109 100644
--- a/fs/dlm/ast.c
+++ b/fs/dlm/ast.c
@@ -292,6 +292,8 @@ void dlm_callback_suspend(struct dlm_ls *ls)
flush_workqueue(ls->ls_callback_wq);
  }
  
+#define MAX_CB_QUEUE 25

+
  void dlm_callback_resume(struct dlm_ls *ls)
  {
struct dlm_lkb *lkb, *safe;
@@ -302,15 +304,23 @@ void dlm_callback_resume(struct dlm_ls *ls)
if (!ls->ls_callback_wq)
return;
  
+more:

mutex_lock(>ls_cb_mutex);
list_for_each_entry_safe(lkb, safe, >ls_cb_delay, lkb_cb_list) {
list_del_init(>lkb_cb_list);
queue_work(ls->ls_callback_wq, >lkb_cb_work);
count++;
+   if (count == MAX_CB_QUEUE)
+   break;
}
mutex_unlock(>ls_cb_mutex);
  
  	if (count)

log_rinfo(ls, "dlm_callback_resume %d", count);
+   if (count == MAX_CB_QUEUE) {
+   count = 0;
+   cond_resched();
+   goto more;
+   }
  }
  



While that is a good thing to do, it looks like the real culprit here 
might be elsewhere. Look at what this is doing... adding a large number 
of work items under the ls_cb_mutex, and then look at what the work item 
does... first thing is to lock the lkb_cb_mutex, so if we have a 
multi-core system then this is creating a large number of work items all 
of which will be fighting each other (and the thread that is trying to 
add new items) for the lock so no wonder it doesn't work efficiently.


If we called the callbacks directly here, then we would avoid all that 
fighting for the mutex and also remove the need for scheduling the work 
item in the first place too. That should greatly decrease the amount of 
cpu time required and reduce latency and contention on the mutex,


Steve.



Re: [Cluster-devel] [PATCH] gfs2: Put bitmap buffers in put_super

2018-11-06 Thread Steven Whitehouse

Hi,

While that looks like a good fix for now, we should look at this again 
in due course. Why do we have a ref to the rgrp buffers held here in the 
first place? Unless the buffers are pinned in the journal there should 
not be a ref held, otherwise they cannot respond to memory pressure,


Steve.


On 06/11/18 09:39, Andreas Gruenbacher wrote:

gfs2_put_super calls gfs2_clear_rgrpd to destroy the gfs2_rgrpd objects
attached to the resource group glocks.  That function should release the
buffers attached to the gfs2_bitmap objects (bi_bh), but the call to
gfs2_rgrp_brelse for doing that is missing.

When gfs2_releasepage later runs across these buffers which are still
referenced, it refuses to free them.  This causes the pages the buffers
are attached to to remain referenced as well.  With enough mount/unmount
cycles, the system will eventually run out of memory.

Fix this by adding the missing call to gfs2_rgrp_brelse in
gfs2_clear_rgrpd.

(Also fix a gfs2_rgrp_relse -> gfs2_rgrp_brelse typo in a comment.)

Fixes: 39b0f1e92908 ("GFS2: Don't brelse rgrp buffer_heads every allocation")
Signed-off-by: Andreas Gruenbacher 
---
  fs/gfs2/rgrp.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index ffe3032b1043..b08a530433ad 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -733,6 +733,7 @@ void gfs2_clear_rgrpd(struct gfs2_sbd *sdp)
  
  		if (gl) {

glock_clear_object(gl, rgd);
+   gfs2_rgrp_brelse(rgd);
gfs2_glock_put(gl);
}
  
@@ -1174,7 +1175,7 @@ static u32 count_unlinked(struct gfs2_rgrpd *rgd)

   * @rgd: the struct gfs2_rgrpd describing the RG to read in
   *
   * Read in all of a Resource Group's header and bitmap blocks.
- * Caller must eventually call gfs2_rgrp_relse() to free the bitmaps.
+ * Caller must eventually call gfs2_rgrp_brelse() to free the bitmaps.
   *
   * Returns: errno
   */




Re: [Cluster-devel] [GFS2 PATCH 0/4] jhead lookup using bios

2018-10-17 Thread Steven Whitehouse

Hi,

This all looks good to me, so now we just need lots of testing. Are you 
still seeing good speed ups vs the current code?


Steve.


On 16/10/18 05:07, Abhi Das wrote:

This is my latest version of this patchset based on inputs from Andreas
and Steve.
We readahead the journal sequentially in large chunks using bios. Pagecache
pages for the journal inode's mapping are used for the I/O.

There's also some cleanup of the bio functions with this patchset.

xfstests ran to completion with this.

Abhi Das (4):
   gfs2: add more timing info to journal recovery process
   gfs2: changes to gfs2_log_XXX_bio
   gfs2: add a helper function to get_log_header that can be used
 elsewhere
   gfs2: read journal in large chunks to locate the head

  fs/gfs2/bmap.c   |   8 +-
  fs/gfs2/glops.c  |   1 +
  fs/gfs2/log.c|   4 +-
  fs/gfs2/lops.c   | 240 +++
  fs/gfs2/lops.h   |   4 +-
  fs/gfs2/ops_fstype.c |   1 +
  fs/gfs2/recovery.c   | 178 --
  fs/gfs2/recovery.h   |   4 +-
  fs/gfs2/super.c  |   1 +
  9 files changed, 255 insertions(+), 186 deletions(-)





Re: [Cluster-devel] [PATCH 7/9] gfs2: Fix marking bitmaps non-full

2018-10-12 Thread Steven Whitehouse

Hi,


On 12/10/18 13:06, Bob Peterson wrote:

- Original Message -

Hi,
The series looks good I think. This one though looks like a bug fix and
should probably go to -stable too?

Steve.

I concur. So can I add your reviewed-by before I push?

Bob Peterson


Yes, please do,

Steve.



Re: [Cluster-devel] [PATCH 7/9] gfs2: Fix marking bitmaps non-full

2018-10-12 Thread Steven Whitehouse

Hi,


On 11/10/18 20:20, Andreas Gruenbacher wrote:

Reservations in gfs can span multiple gfs2_bitmaps (but they won't span
multiple resource groups).  When removing a reservation, we want to
clear the GBF_FULL flags of all involved gfs2_bitmaps, not just that of
the first bitmap.

Signed-off-by: Andreas Gruenbacher 
---
  fs/gfs2/rgrp.c | 13 +++--
  1 file changed, 11 insertions(+), 2 deletions(-)
The series looks good I think. This one though looks like a bug fix and 
should probably go to -stable too?


Steve.



diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index f47c76d9d9d0..7c5904c49a6a 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -641,7 +641,10 @@ static void __rs_deltree(struct gfs2_blkreserv *rs)
RB_CLEAR_NODE(>rs_node);
  
  	if (rs->rs_free) {

-   struct gfs2_bitmap *bi = rbm_bi(>rs_rbm);
+   u64 last_block = gfs2_rbm_to_block(>rs_rbm) +
+rs->rs_free - 1;
+   struct gfs2_rbm last_rbm = { .rgd = rs->rs_rbm.rgd, };
+   struct gfs2_bitmap *start, *last;
  
  		/* return reserved blocks to the rgrp */

BUG_ON(rs->rs_rbm.rgd->rd_reserved < rs->rs_free);
@@ -652,7 +655,13 @@ static void __rs_deltree(struct gfs2_blkreserv *rs)
   it will force the number to be recalculated later. */
rgd->rd_extfail_pt += rs->rs_free;
rs->rs_free = 0;
-   clear_bit(GBF_FULL, >bi_flags);
+   if (gfs2_rbm_from_block(_rbm, last_block))
+   return;
+   start = rbm_bi(>rs_rbm);
+   last = rbm_bi(_rbm);
+   do
+   clear_bit(GBF_FULL, >bi_flags);
+   while (start++ != last);
}
  }
  




Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find insert glock

2018-10-09 Thread Steven Whitehouse




On 09/10/18 09:13, Mark Syms wrote:

Having swapped the line below around we still see the timeout on schedule fire, 
but only once in
a fairly mega stress test. This is why we weren't worried about the timeout 
being HZ, the situation
is hardly ever hit as having to wait is rare and normally we are woken from 
schedule and without
a timeout on schedule we never wake up so a rare occurrence of waiting a second 
really doesn't
seem too bad.

Mark.
We should still get to the bottom of why the wake up is missing though, 
since without that fix we won't know if there is something else wrong 
somewhere,


Steve.



-Original Message-
From: Tim Smith 
Sent: 08 October 2018 14:27
To: Steven Whitehouse 
Cc: Mark Syms ; cluster-devel@redhat.com; Ross Lagerwall 

Subject: Re: [Cluster-devel] [PATCH 1/2] GFS2: use schedule timeout in find 
insert glock

On Monday, 8 October 2018 14:13:10 BST Steven Whitehouse wrote:

Hi,

On 08/10/18 14:10, Tim Smith wrote:

On Monday, 8 October 2018 14:03:24 BST Steven Whitehouse wrote:

On 08/10/18 13:59, Mark Syms wrote:

That sounds entirely reasonable so long as you are absolutely sure
that nothing is ever going to mess with that glock, we erred on
the side of more caution not knowing whether it would be guaranteed safe or not.

Thanks,

Mark

We should have a look at the history to see how that wait got added.
However the "dead" flag here means "don't touch this glock" and is
there so that we can separate the marking dead from the actual
removal from the list (which simplifies the locking during the
scanning procedures)

You beat me to it :-)

I think there might be a bit of a problem inserting a new entry with
the same name before the old entry has been fully destroyed (or at
least removed), which would be why the schedule() is there.

If the old entry is marked dead, all future lookups should ignore it.
We should only have a single non-dead entry at a time, but that
doesn't seem like it should need us to wait for it.

On the second call we do have the new glock to insert as arg2, so we could try 
to swap them cleanly, yeah.


If we do discover that the wait is really required, then it sounds
like as you mentioned above there is a lost wakeup, and that must
presumably be on a code path that sets the dead flag and then fails to
send a wake up later on. If we can drop the wait in the first place,
that seems like a better plan,

Ooooh, I wonder if these two lines:

wake_up_glock(gl);
call_rcu(>gl_rcu, gfs2_glock_dealloc);

in gfs2_glock_free() are the wrong way round?

--
Tim Smith 






Re: [Cluster-devel] [PATCH 02/11] gfs2: Move rs_{sizehint, rgd_gh} fields into the inode

2018-10-08 Thread Steven Whitehouse

Hi,


On 05/10/18 20:18, Andreas Gruenbacher wrote:

Move the rs_sizehint and rs_rgd_gh fields from struct gfs2_blkreserv
into the inode: they are more closely related to the inode than to a
particular reservation.
Yes, that makes sense I think - these fields have moved around a bit 
during the discussions on getting this code right. The only real issue 
here is that the gh is quite large, which was why it was separate in the 
first place, but it probably makes more sense to do it this way,


Steve.


Signed-off-by: Andreas Gruenbacher 
---
  fs/gfs2/file.c   |  4 ++--
  fs/gfs2/incore.h |  6 ++
  fs/gfs2/main.c   |  2 ++
  fs/gfs2/rgrp.c   | 16 +++-
  4 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 08369c6cd127..e8864ff2ed03 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -347,8 +347,8 @@ static void gfs2_size_hint(struct file *filep, loff_t 
offset, size_t size)
size_t blks = (size + sdp->sd_sb.sb_bsize - 1) >> 
sdp->sd_sb.sb_bsize_shift;
int hint = min_t(size_t, INT_MAX, blks);
  
-	if (hint > atomic_read(>i_res.rs_sizehint))

-   atomic_set(>i_res.rs_sizehint, hint);
+   if (hint > atomic_read(>i_sizehint))
+   atomic_set(>i_sizehint, hint);
  }
  
  /**

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index b96d39c28e17..9d7d9bd8c3a9 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -309,10 +309,6 @@ struct gfs2_qadata { /* quota allocation data */
  */
  
  struct gfs2_blkreserv {

-   /* components used during write (step 1): */
-   atomic_t rs_sizehint; /* hint of the write size */
-
-   struct gfs2_holder rs_rgd_gh; /* Filled in by get_local_rgrp */
struct rb_node rs_node;   /* link to other block reservations */
struct gfs2_rbm rs_rbm;   /* Start of reservation */
u32 rs_free;  /* how many blocks are still free */
@@ -417,8 +413,10 @@ struct gfs2_inode {
struct gfs2_holder i_iopen_gh;
struct gfs2_holder i_gh; /* for prepare/commit_write only */
struct gfs2_qadata *i_qadata; /* quota allocation data */
+   struct gfs2_holder i_rgd_gh;
struct gfs2_blkreserv i_res; /* rgrp multi-block reservation */
u64 i_goal; /* goal block for allocations */
+   atomic_t i_sizehint;  /* hint of the write size */
struct rw_semaphore i_rw_mutex;
struct list_head i_ordered;
struct list_head i_trunc_list;
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 2d55e2cc..c7603063f861 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -39,9 +39,11 @@ static void gfs2_init_inode_once(void *foo)
struct gfs2_inode *ip = foo;
  
  	inode_init_once(>i_inode);

+   atomic_set(>i_sizehint, 0);
init_rwsem(>i_rw_mutex);
INIT_LIST_HEAD(>i_trunc_list);
ip->i_qadata = NULL;
+   gfs2_holder_mark_uninitialized(>i_rgd_gh);
memset(>i_res, 0, sizeof(ip->i_res));
RB_CLEAR_NODE(>i_res.rs_node);
ip->i_hash_cache = NULL;
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index c9caddc2627c..34122c546576 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1564,7 +1564,7 @@ static void rg_mblk_search(struct gfs2_rgrpd *rgd, struct 
gfs2_inode *ip,
if (S_ISDIR(inode->i_mode))
extlen = 1;
else {
-   extlen = max_t(u32, atomic_read(>rs_sizehint), ap->target);
+   extlen = max_t(u32, atomic_read(>i_sizehint), ap->target);
extlen = clamp(extlen, RGRP_RSRV_MINBLKS, free_blocks);
}
if ((rgd->rd_free_clone < rgd->rd_reserved) || (free_blocks < extlen))
@@ -2076,7 +2076,7 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct 
gfs2_alloc_parms *ap)
}
error = gfs2_glock_nq_init(rs->rs_rbm.rgd->rd_gl,
   LM_ST_EXCLUSIVE, flags,
-  >rs_rgd_gh);
+  >i_rgd_gh);
if (unlikely(error))
return error;
if (!gfs2_rs_active(rs) && (loops < 2) &&
@@ -2085,7 +2085,7 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct 
gfs2_alloc_parms *ap)
if (sdp->sd_args.ar_rgrplvb) {
error = update_rgrp_lvb(rs->rs_rbm.rgd);
if (unlikely(error)) {
-   gfs2_glock_dq_uninit(>rs_rgd_gh);
+   gfs2_glock_dq_uninit(>i_rgd_gh);
return error;
}
}
@@ -2128,7 +2128,7 @@ int gfs2_inplace_reserve(struct gfs2_inode *ip, struct 
gfs2_alloc_parms *ap)
  
  		/* Unlock rgrp if required */

if (!rg_locked)
-   

Re: [Cluster-devel] [PATCH 11/11] gfs2: Add local resource group locking

2018-10-08 Thread Steven Whitehouse

Hi,


On 05/10/18 20:18, Andreas Gruenbacher wrote:

From: Bob Peterson 

Prepare for treating resource group glocks as exclusive among nodes but
shared among all tasks running on a node: introduce another layer of
node-specific locking that the local tasks can use to coordinate their
accesses.

This patch only introduces the local locking changes necessary so that
future patches can introduce resource group glock sharing.  We replace
the resource group spinlock with a mutex; whether that leads to
noticeable additional contention on the resource group mutex remains to
be seen.

Signed-off-by: Andreas Gruenbacher 
---
  fs/gfs2/incore.h |  3 +-
  fs/gfs2/lops.c   |  5 ++-
  fs/gfs2/rgrp.c   | 97 +++-
  fs/gfs2/rgrp.h   |  4 ++
  4 files changed, 81 insertions(+), 28 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 932e63924f7e..2fa47b476eef 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -23,6 +23,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #define DIO_WAIT	0x0010

  #define DIO_METADATA  0x0020
@@ -120,7 +121,7 @@ struct gfs2_rgrpd {
  #define GFS2_RDF_ERROR0x4000 /* error in rg */
  #define GFS2_RDF_PREFERRED0x8000 /* This rgrp is preferred */
  #define GFS2_RDF_MASK 0xf000 /* mask for internal flags */
-   spinlock_t rd_rsspin;   /* protects reservation related vars */
+   struct mutex rd_lock;
I can see why we might need to have additional local rgrp locking, but 
why do we need to make the sd_rsspin into a mutex? I'm wondering if 
these should not be separate locks still?


Steve.


struct rb_root rd_rstree;   /* multi-block reservation tree */
  };
  
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c

index 4c7069b8f3c1..a9e858e01c97 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -76,8 +76,9 @@ static void maybe_release_space(struct gfs2_bufdata *bd)
unsigned int index = bd->bd_bh->b_blocknr - gl->gl_name.ln_number;
struct gfs2_bitmap *bi = rgd->rd_bits + index;
  
+	rgrp_lock_local(rgd);

if (bi->bi_clone == NULL)
-   return;
+   goto out;
if (sdp->sd_args.ar_discard)
gfs2_rgrp_send_discards(sdp, rgd->rd_data0, bd->bd_bh, bi, 1, 
NULL);
memcpy(bi->bi_clone + bi->bi_offset,
@@ -85,6 +86,8 @@ static void maybe_release_space(struct gfs2_bufdata *bd)
clear_bit(GBF_FULL, >bi_flags);
rgd->rd_free_clone = rgd->rd_free;
rgd->rd_extfail_pt = rgd->rd_free;
+out:
+   rgrp_unlock_local(rgd);
  }
  
  /**

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 8a6b41f3667c..a89be4782c15 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -702,10 +702,10 @@ void gfs2_rs_deltree(struct gfs2_blkreserv *rs)
  
  	rgd = rs->rs_rgd;

if (rgd) {
-   spin_lock(>rd_rsspin);
+   rgrp_lock_local(rgd);
__rs_deltree(rs);
BUG_ON(rs->rs_free);
-   spin_unlock(>rd_rsspin);
+   rgrp_unlock_local(rgd);
}
  }
  
@@ -737,12 +737,12 @@ static void return_all_reservations(struct gfs2_rgrpd *rgd)

struct rb_node *n;
struct gfs2_blkreserv *rs;
  
-	spin_lock(>rd_rsspin);

+   rgrp_lock_local(rgd);
while ((n = rb_first(>rd_rstree))) {
rs = rb_entry(n, struct gfs2_blkreserv, rs_node);
__rs_deltree(rs);
}
-   spin_unlock(>rd_rsspin);
+   rgrp_unlock_local(rgd);
  }
  
  void gfs2_clear_rgrpd(struct gfs2_sbd *sdp)

@@ -948,7 +948,7 @@ static int read_rindex_entry(struct gfs2_inode *ip)
rgd->rd_data0 = be64_to_cpu(buf.ri_data0);
rgd->rd_data = be32_to_cpu(buf.ri_data);
rgd->rd_bitbytes = be32_to_cpu(buf.ri_bitbytes);
-   spin_lock_init(>rd_rsspin);
+   mutex_init(>rd_lock);
  
  	error = compute_bitstructs(rgd);

if (error)
@@ -1469,9 +1469,11 @@ int gfs2_fitrim(struct file *filp, void __user *argp)
/* Trim each bitmap in the rgrp */
for (x = 0; x < rgd->rd_length; x++) {
struct gfs2_bitmap *bi = rgd->rd_bits + x;
+   rgrp_lock_local(rgd);
ret = gfs2_rgrp_send_discards(sdp,
rgd->rd_data0, NULL, bi, minlen,
);
+   rgrp_unlock_local(rgd);
if (ret) {
gfs2_glock_dq_uninit();
goto out;
@@ -1483,9 +1485,11 @@ int gfs2_fitrim(struct file *filp, void __user *argp)
ret = gfs2_trans_begin(sdp, RES_RG_HDR, 0);
if (ret == 0) {
bh = rgd->rd_bits[0].bi_bh;
+   rgrp_lock_local(rgd);
  

Re: [Cluster-devel] [PATCH 10/11] gfs2: Pass resource group to rgblk_free

2018-10-08 Thread Steven Whitehouse




On 05/10/18 20:18, Andreas Gruenbacher wrote:

Function rgblk_free can only deal with one resource group at a time, so
pass that resource group is as a parameter.  Several of the callers
already have the resource group at hand, so we only need additional
lookup code in a few places.

Signed-off-by: Andreas Gruenbacher 

That looks like a good optimisation,

Steve.


---
  fs/gfs2/bmap.c  |  4 ++--
  fs/gfs2/dir.c   |  5 -
  fs/gfs2/rgrp.c  | 42 +++---
  fs/gfs2/rgrp.h  |  6 --
  fs/gfs2/xattr.c | 16 +---
  5 files changed, 34 insertions(+), 39 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index c192906bb5f6..55e8ad1a6e13 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1566,7 +1566,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, 
struct gfs2_holder *rd_gh,
continue;
}
if (bstart) {
-   __gfs2_free_blocks(ip, bstart, (u32)blen, meta);
+   __gfs2_free_blocks(ip, rgd, bstart, (u32)blen, meta);
(*btotal) += blen;
gfs2_add_inode_blocks(>i_inode, -blen);
}
@@ -1574,7 +1574,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, 
struct gfs2_holder *rd_gh,
blen = 1;
}
if (bstart) {
-   __gfs2_free_blocks(ip, bstart, (u32)blen, meta);
+   __gfs2_free_blocks(ip, rgd, bstart, (u32)blen, meta);
(*btotal) += blen;
gfs2_add_inode_blocks(>i_inode, -blen);
}
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 89c601e5e52f..f9c6c7ee89e1 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -2039,6 +2039,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 
index, u32 len,
bh = leaf_bh;
  
  	for (blk = leaf_no; blk; blk = nblk) {

+   struct gfs2_rgrpd *rgd;
+
if (blk != leaf_no) {
error = get_leaf(dip, blk, );
if (error)
@@ -2049,7 +2051,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 
index, u32 len,
if (blk != leaf_no)
brelse(bh);
  
-		gfs2_free_meta(dip, blk, 1);

+   rgd = gfs2_blk2rgrpd(sdp, blk, true);
+   gfs2_free_meta(dip, rgd, blk, 1);
gfs2_add_inode_blocks(>i_inode, -1);
}
  
diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c

index 76a0a8073c11..8a6b41f3667c 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -2245,26 +2245,19 @@ static void gfs2_alloc_extent(const struct gfs2_rbm 
*rbm, bool dinode,
  /**
   * rgblk_free - Change alloc state of given block(s)
   * @sdp: the filesystem
+ * @rgd: the resource group the blocks are in
   * @bstart: the start of a run of blocks to free
   * @blen: the length of the block run (all must lie within ONE RG!)
   * @new_state: GFS2_BLKST_XXX the after-allocation block state
- *
- * Returns:  Resource group containing the block(s)
   */
  
-static struct gfs2_rgrpd *rgblk_free(struct gfs2_sbd *sdp, u64 bstart,

-u32 blen, unsigned char new_state)
+static void rgblk_free(struct gfs2_sbd *sdp, struct gfs2_rgrpd *rgd,
+  u64 bstart, u32 blen, unsigned char new_state)
  {
struct gfs2_rbm rbm;
struct gfs2_bitmap *bi, *bi_prev = NULL;
  
-	rbm.rgd = gfs2_blk2rgrpd(sdp, bstart, 1);

-   if (!rbm.rgd) {
-   if (gfs2_consist(sdp))
-   fs_err(sdp, "block = %llu\n", (unsigned long 
long)bstart);
-   return NULL;
-   }
-
+   rbm.rgd = rgd;
BUG_ON(gfs2_rbm_from_block(, bstart));
while (blen--) {
bi = rbm_bi();
@@ -2282,8 +2275,6 @@ static struct gfs2_rgrpd *rgblk_free(struct gfs2_sbd 
*sdp, u64 bstart,
gfs2_setbit(, false, new_state);
gfs2_rbm_incr();
}
-
-   return rbm.rgd;
  }
  
  /**

@@ -2499,20 +2490,19 @@ int gfs2_alloc_blocks(struct gfs2_inode *ip, u64 *bn, 
unsigned int *nblocks,
  /**
   * __gfs2_free_blocks - free a contiguous run of block(s)
   * @ip: the inode these blocks are being freed from
+ * @rgd: the resource group the blocks are in
   * @bstart: first block of a run of contiguous blocks
   * @blen: the length of the block run
   * @meta: 1 if the blocks represent metadata
   *
   */
  
-void __gfs2_free_blocks(struct gfs2_inode *ip, u64 bstart, u32 blen, int meta)

+void __gfs2_free_blocks(struct gfs2_inode *ip, struct gfs2_rgrpd *rgd,
+   u64 bstart, u32 blen, int meta)
  {
struct gfs2_sbd *sdp = GFS2_SB(>i_inode);
-   struct gfs2_rgrpd *rgd;
  
-	rgd = rgblk_free(sdp, bstart, blen, GFS2_BLKST_FREE);

-   if (!rgd)
-   return;
+   rgblk_free(sdp, rgd, bstart, blen, GFS2_BLKST_FREE);
trace_gfs2_block_alloc(ip, rgd, bstart, blen, GFS2_BLKST_FREE);
rgd->rd_free += blen;

Re: [Cluster-devel] [PATCH 07/11] gfs2: Fix marking bitmaps non-full

2018-10-08 Thread Steven Whitehouse




On 05/10/18 20:18, Andreas Gruenbacher wrote:

Reservations in gfs can span multiple gfs2_bitmaps (but they won't span
multiple resource groups).  When removing a reservation, we want to
clear the GBF_FULL flags of all involved gfs2_bitmaps, not just that of
the first bitmap.

This looks like a bug fix that we should have anyway,

Steve.


Signed-off-by: Andreas Gruenbacher 
---
  fs/gfs2/rgrp.c | 13 +
  1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index ee6ea7d8cf44..ee981085db33 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -672,7 +672,7 @@ static void __rs_deltree(struct gfs2_blkreserv *rs)
RB_CLEAR_NODE(>rs_node);
  
  	if (rs->rs_free) {

-   struct gfs2_bitmap *bi;
+   struct gfs2_bitmap *start, *last;
  
  		/* return reserved blocks to the rgrp */

BUG_ON(rs->rs_rgd->rd_reserved < rs->rs_free);
@@ -682,10 +682,15 @@ static void __rs_deltree(struct gfs2_blkreserv *rs)
   contiguous with a span of free blocks that follows. Still,
   it will force the number to be recalculated later. */
rgd->rd_extfail_pt += rs->rs_free;
+   start = gfs2_block_to_bitmap(rgd, rs->rs_start);
+   last = gfs2_block_to_bitmap(rgd,
+   rs->rs_start + rs->rs_free - 1);
rs->rs_free = 0;
-   bi = gfs2_block_to_bitmap(rgd, rs->rs_start);
-   if (bi)
-   clear_bit(GBF_FULL, >bi_flags);
+   if (!start || !last)
+   return;
+   do
+   clear_bit(GBF_FULL, >bi_flags);
+   while (start++ != last);
}
  }
  




Re: [Cluster-devel] [PATCH 01/11] gfs2: Always check the result of gfs2_rbm_from_block

2018-10-08 Thread Steven Whitehouse

Hi,


On 05/10/18 20:18, Andreas Gruenbacher wrote:

When gfs2_rbm_from_block fails, the rbm it returns is undefined, so we
always want to make sure gfs2_rbm_from_block has succeeded before
looking at the rbm.

Signed-off-by: Andreas Gruenbacher 
---
  fs/gfs2/rgrp.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index fc181c81cca2..c9caddc2627c 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -2227,7 +2227,7 @@ static struct gfs2_rgrpd *rgblk_free(struct gfs2_sbd 
*sdp, u64 bstart,
return NULL;
}
  
-	gfs2_rbm_from_block(, bstart);

+   BUG_ON(gfs2_rbm_from_block(, bstart));
while (blen--) {
bi = rbm_bi();
if (bi != bi_prev) {
@@ -2360,7 +2360,7 @@ static void gfs2_set_alloc_start(struct gfs2_rbm *rbm,
else
goal = rbm->rgd->rd_last_alloc + rbm->rgd->rd_data0;
  
-	gfs2_rbm_from_block(rbm, goal);

+   BUG_ON(gfs2_rbm_from_block(rbm, goal));
  }
Could we make this a warn once with a failure path here? It is a bit 
more friendly than BUG_ON,


Steve.

  
  /**

@@ -2569,7 +2569,8 @@ int gfs2_check_blk_type(struct gfs2_sbd *sdp, u64 
no_addr, unsigned int type)
  
  	rbm.rgd = rgd;

error = gfs2_rbm_from_block(, no_addr);
-   WARN_ON_ONCE(error != 0);
+   if (WARN_ON_ONCE(error))
+   goto fail;
  
  	if (gfs2_testbit(, false) != type)

error = -ESTALE;




Re: [Cluster-devel] [GFS2 PATCH] gfs2: print fsid when rgrp errors are found

2018-10-04 Thread Steven Whitehouse

Hi,

Presumably the other calls to pr_warn have the same issue? Perhaps we 
should just get rid of pr_warn completely unless there are cases where a 
super block is not available to pass to fs_warn,


Steve.


On 04/10/18 16:18, Bob Peterson wrote:

Hi,

This patch allows gfs2_setbit to provide useful debug information when
bitmap errors are encountered. It now provides the file system id so
you can tell which mount point the error occurred, as well as the
intended block and the bitmap's block number.

Signed-off-by: Bob Peterson 
---
  fs/gfs2/rgrp.c | 14 +-
  1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index fc181c81cca2..506d09d70b8a 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -101,12 +101,16 @@ static inline void gfs2_setbit(const struct gfs2_rbm 
*rbm, bool do_clone,
cur_state = (*byte1 >> bit) & GFS2_BIT_MASK;
  
  	if (unlikely(!valid_change[new_state * 4 + cur_state])) {

-   pr_warn("buf_blk = 0x%x old_state=%d, new_state=%d\n",
+   struct gfs2_sbd *sdp = rbm->rgd->rd_sbd;
+
+   fs_warn(sdp, "buf_blk = 0x%x old_state=%d, new_state=%d\n",
rbm->offset, cur_state, new_state);
-   pr_warn("rgrp=0x%llx bi_start=0x%x\n",
-   (unsigned long long)rbm->rgd->rd_addr, bi->bi_start);
-   pr_warn("bi_offset=0x%x bi_len=0x%x\n",
-   bi->bi_offset, bi->bi_len);
+   fs_warn(sdp, "rgrp=0x%llx bi_start=0x%x biblk: 0x%llx\n",
+   (unsigned long long)rbm->rgd->rd_addr, bi->bi_start,
+   (unsigned long long)bi->bi_bh->b_blocknr);
+   fs_warn(sdp, "bi_offset=0x%x bi_len=0x%x block=0x%llx\n",
+   bi->bi_offset, bi->bi_len,
+   (unsigned long long)gfs2_rbm_to_block(rbm));
dump_stack();
gfs2_consist_rgrpd(rbm->rgd);
return;





Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

2018-09-28 Thread Steven Whitehouse

Hi,


On 28/09/18 14:43, Tim Smith wrote:

On Friday, 28 September 2018 14:18:59 BST Steven Whitehouse wrote:

Hi,

On 28/09/18 13:50, Mark Syms wrote:

Hi Bon,

The patches look quite good and would seem to help in the intra-node
congestion case, which our first patch was trying to do. We haven't tried
them yet but I'll pull a build together and try to run it over the
weekend.

We don't however, see that they would help in the situation we saw for the
second patch where rgrp glocks would get bounced around between hosts at
high speed and cause lots of state flushing to occur in the process as
the stats don't take any account of anything other than network latency
whereas there is more involved with a rgrp glock when state needs to be
flushed.

Any thoughts on this?

Thanks,

Mark.

There are a few points here... the stats measure the latency of the DLM
requests. Since in order to release a lock, some work has to be done,
and the lock is not released until that work is complete, the stats do
include that in their timings.

I think what's happening for us is that the work that needs to be done to
release an rgrp lock is happening pretty fast and is about the same in all
cases, so the stats are not providing a meaningful distinction. We see the
same lock (or small number of locks) bouncing back and forth between nodes
with neither node seeming to consider them congested enough to avoid, even
though the FS is <50% full and there must be plenty of other non-full rgrps.



It could well be that is the case. The system was designed to deal with 
inter-node contention on resource group locks. If there is no inter-node 
contention then the times should be similar and the system should have 
little effect. If the contention is all intra-node then we'd prefer a 
solution which increases the parallelism there - it covers more use 
cases than just allocation. Also it will help to keep related blocks 
closer too each other, particularly as the filesystem ages.


If might also be that there is a bug too - so worth looking closely at 
the numbers just to make sure that it is working as intended,


Steve.



Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

2018-09-28 Thread Steven Whitehouse

Hi,


On 28/09/18 13:50, Mark Syms wrote:

Hi Bon,

The patches look quite good and would seem to help in the intra-node congestion 
case, which our first patch was trying to do. We haven't tried them yet but 
I'll pull a build together and try to run it over the weekend.

We don't however, see that they would help in the situation we saw for the 
second patch where rgrp glocks would get bounced around between hosts at high 
speed and cause lots of state flushing to occur in the process as the stats 
don't take any account of anything other than network latency whereas there is 
more involved with a rgrp glock when state needs to be flushed.

Any thoughts on this?

Thanks,

Mark.
There are a few points here... the stats measure the latency of the DLM 
requests. Since in order to release a lock, some work has to be done, 
and the lock is not released until that work is complete, the stats do 
include that in their timings.


There are several parts to the complete picture here:

1. Resource group selection for allocation (which is what the current 
stats based solution tries to do)
 - Note this will not help deallocation, as then there is no choice in 
which resource group we use! So the following two items should address 
deallocation too...
2. Parallelism of resource group usage within a single node (currently 
missing, but we hope to add this feature shortly)
3. Reduction in latency when glocks need to be demoted for use on 
another node (something we plan to address in due course)


All these things are a part of the overall picture, and we need to be 
careful not to try and optimise one at the expense of others. It is 
actually quite easy to get a big improvement in one particular workload, 
but if we are not careful, it may well be at the expense of another that 
we've not taken into account. There will always be a trade off between 
locality and parallelism of course, but we do have to be fairly cautious 
here too.


We are of course very happy to encourage work in this area, since it 
should help us gain a greater insight into the various dependencies 
between these parts, and result in a better overall solution. I hope 
that helps to give a rough idea of our current thoughts and where we 
hope to get to in due course,


Steve.


-Original Message-
From: Mark Syms
Sent: 28 September 2018 13:37
To: 'Bob Peterson' 
Cc: cluster-devel@redhat.com; Tim Smith ; Ross Lagerwall 

Subject: RE: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance 
improvements

Hi Bob,

No, we haven't but it wouldn't be hard for us to replace our patches in our 
internal patchqueue with these and try them. Will let you know what we find.

We have also seen, what we think is an unrelated issue where we get the 
following backtrace in kern.log and our system stalls

Sep 21 21:19:09 cl15-05 kernel: [21389.462707] INFO: task python:15480 blocked 
for more than 120 seconds.
Sep 21 21:19:09 cl15-05 kernel: [21389.462749]   Tainted: G   O
4.4.0+10 #1
Sep 21 21:19:09 cl15-05 kernel: [21389.462763] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 21:19:09 cl15-05 kernel: [21389.462783] python  D 
88019628bc90 0 15480  1 0x
Sep 21 21:19:09 cl15-05 kernel: [21389.462790]  88019628bc90 
880198f11c00 88005a509c00 88019628c000 Sep 21 21:19:09 cl15-05 
kernel: [21389.462795]  c90040226000 88019628bd80 fe58 
8801818da418 Sep 21 21:19:09 cl15-05 kernel: [21389.462799]  
88019628bca8 815a1cd4 8801818da5c0 88019628bd68 Sep 21 
21:19:09 cl15-05 kernel: [21389.462803] Call Trace:
Sep 21 21:19:09 cl15-05 kernel: [21389.462815]  [] schedule+0x64/0x80 Sep 21 21:19:09 cl15-05 kernel: 
[21389.462877]  [] find_insert_glock+0x4a4/0x530 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462891]  
[] ? gfs2_holder_wake+0x20/0x20 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462903]  [] 
gfs2_glock_get+0x3d/0x330 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462928]  [] do_flock+0xf2/0x210 [gfs2] Sep 
21 21:19:09 cl15-05 kernel: [21389.462933]  [] ? gfs2_getattr+0xe0/0xf0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: 
[21389.462938]  [] ? cp_new_stat+0x10b/0x120 Sep 21 21:19:09 cl15-05 kernel: [21389.462943]  
[] gfs2_flock+0x78/0xa0 [gfs2] Sep 21 21:19:09 cl15-05 kernel: [21389.462946]  [] 
SyS_flock+0x129/0x170 Sep 21 21:19:09 cl15-05 kernel: [21389.462948]  [] entry_SYSCALL_64_fastpath+0x12/0x71

We think there is a possibility, given that this code path only gets entered if 
a glock is being destroyed, that there is a time of check, time of use issue 
here where by the time that schedule gets called the thing which we expect to 
be waking us up has completed dying and therefore won't trigger a wakeup for 
us. We only seen this a couple of times in fairly intensive VM stress tests 
where a lot of flocks get used on a small number of lock files (we use them to 
ensure consistent behaviour of disk 

Re: [Cluster-devel] [GFS2 RFC PATCH 3/3] gfs2: introduce bio_pool to readahead journal to find jhead

2018-09-25 Thread Steven Whitehouse

Hi,



On 25/09/18 06:38, Abhi Das wrote:

This patch adds a new data structure called bio_pool. This is
basically a dynamically allocated array of struct bio* and
associated variables to manage this data structure.

The array is used in a circular fashion until the entire array
has bios that are in flight. i.e. they need to be waited on and
consumed upon completion, in order to make room for more. To
locate the journal head, we read the journal sequentially from
the beginning, creating bios and submitting them as necessary.

We wait for these inflight bios in the order we submit them even
though the block layer may complete them out of order. This strict
ordering allows us to determine the journal head without having
to do extra reads.

A tunable allows us to configure the size of the bio_pool.
I'd rather not introduce a new tunable. What size should the pool be? Is 
there any reason that we even need to have the array to keep track of 
the bios? If the pages are in the page cache (i.e. address space of the 
journal inode) then we should be able to simply wait on the pages in 
order I think, without needing a separate list,


Steve.


Signed-off-by: Abhi Das 
---
  fs/gfs2/incore.h |   3 +
  fs/gfs2/lops.c   | 359 +++
  fs/gfs2/lops.h   |   1 +
  fs/gfs2/ops_fstype.c |   2 +
  fs/gfs2/recovery.c   | 116 ++---
  fs/gfs2/sys.c|  27 ++--
  6 files changed, 391 insertions(+), 117 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index b96d39c..424687f 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -542,6 +542,8 @@ struct gfs2_jdesc {
int jd_recover_error;
/* Replay stuff */
  
+	struct gfs2_log_header_host jd_jhead;

+   struct mutex jd_jh_mutex;
unsigned int jd_found_blocks;
unsigned int jd_found_revokes;
unsigned int jd_replayed_blocks;
@@ -610,6 +612,7 @@ struct gfs2_tune {
unsigned int gt_complain_secs;
unsigned int gt_statfs_quantum;
unsigned int gt_statfs_slow;
+   unsigned int gt_bio_pool_size; /* No of bios to use for the bio_pool */
  };
  
  enum {

diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index f2567f9..69fc058 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -18,6 +18,7 @@
  #include 
  #include 
  
+#include "bmap.h"

  #include "dir.h"
  #include "gfs2.h"
  #include "incore.h"
@@ -370,6 +371,364 @@ void gfs2_log_write_page(struct gfs2_sbd *sdp, struct 
page *page)
   gfs2_log_bmap(sdp));
  }
  
+/*

+ * The bio_pool structure is an array of bios of length 'size'.
+ * 'cur' is the index of the next bio to be submitted for I/O.
+ * 'wait' is the index of bio we need to wait on for I/O completion.
+ * 'inflight' is the number of bios submitted, but not yet completed.
+ */
+struct bio_pool {
+   struct bio **bios;
+   unsigned int size;
+   unsigned int cur;
+   unsigned int wait;
+   unsigned int inflight;
+};
+typedef int (search_bio_t) (struct gfs2_jdesc *jd, const void *ptr);
+
+/**
+ * bio_pool_submit_bio - Submit the current bio in the pool
+ *
+ * @pool: The bio pool
+ *
+ * Submit the current bio (pool->bios[pool->cur]) and update internal pool
+ * management variables. If pool->inflight == pool->size, we've maxed out all
+ * the bios in our pool and the caller needs to wait on some bios, process and
+ * free them so new ones can be added.
+ *
+ * Returns: 1 if we maxed out our bios, 0, otherwise
+ */
+
+static int bio_pool_submit_bio(struct bio_pool *pool)
+{
+   int ret = 0;
+   BUG_ON(!pool || !pool->bios || !pool->bios[pool->cur]);
+
+   bio_set_op_attrs(pool->bios[pool->cur], REQ_OP_READ, 0);
+   submit_bio(pool->bios[pool->cur]);
+   pool->cur = pool->cur == pool->size - 1 ? 0 : pool->cur + 1;
+   pool->inflight++;
+   if (pool->inflight == pool->size)
+   ret = 1;
+   return ret;
+}
+
+/**
+ * bio_pool_get_cur - Do what's necessary to get a valid bio for the caller.
+ *
+ * @pool: The bio pool
+ * @sdp: The gfs2 superblock
+ * @blkno: The block number we wish to add to a bio
+ * @end_io: The end_io completion callback
+ *
+ * If there's no currently active bio, we allocate one for the blkno and 
return.
+ *
+ * If there's an active bio at pool->bios[pool->cur], we check if the requested
+ * block maybe to tacked onto it. If yes, we do nothing and return.
+ *
+ * If the block can't be added (non-contiguous), we submit the current bio.
+ * pool->cur, pool->inflight will change and we fall through to allocate a new
+ * bio and return. In this case, it is possible that submitting the current bio
+ * has maxed out our readahead (bio_pool_submit_bio() returns 1). We pass this
+ * error code back to the caller.
+ *
+ * Returns: 1 if bio_pool_submit_bio() maxed readahead, else 0.
+ */
+
+static int bio_pool_get_cur(struct bio_pool *pool, struct gfs2_sbd *sdp,
+   u64 blkno, bio_end_io_t end_io, void 

Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance improvements

2018-09-20 Thread Steven Whitehouse

Hi,


On 20/09/18 18:47, Mark Syms wrote:

Thanks for that Bob, we've been watching with interest the changes going in 
upstream but at the moment we're not really in a position to take advantage of 
them.

Due to hardware vendor support certification requirements XenServer can only 
very occasionally make big kernel bumps that would affect the ABI that the 
driver would see as that would require our hardware partners to recertify. So, 
we're currently on a 4.4.52 base but the gfs2 driver is somewhat newer as it is 
essentially self-contained and therefore we can backport change more easily. We 
currently have most of the GFS2 and DLM changes that are in 4.15 backported 
into the XenServer 7.6 kernel, but we can't take the ones related to iomap as 
they are more invasive and it looks like a number of the more recent 
performance targeting changes are also predicated on the iomap framework.

As I mentioned in the covering letter, the intra host problem would largely be a 
non-issue if EX glocks were actually a host wide thing with local mutexes used to share 
them within the host. I don't know if this is what your patch set is trying to achieve or 
not. It's not so much that that selection of resource group is "random", just 
that there is a random chance that we won't select the first RG that we test, it probably 
does work out much the same though.
Yes, that is the goal. Those patches shouldn't depend directly on the 
iomap work, but there is likely to be some overlap there.



The inter host problem addressed by the second patch seems to be less amenable to 
avoidance as the hosts don't seem to have a synchronous view of the state of the resource 
group locks (for understandable reasons as I'd expect thisto be very expensive to keep 
sync'd). So it seemed reasonable to try to make it "expensive" to request a 
resource that someone else is using and also to avoid immediately grabbing it back if 
we've been asked to relinquish it. It does seem to give a fairer balance to the usage 
without being massively invasive.

We thought we should share these with the community anyway even if they only 
serve as inspiration for more detailed changes and also to describe the 
scenarios where we're seeing issues now that we have completed implementing the 
XenServer support for GFS2 that we discussed back in Nuremburg last year. In 
our testing they certainly make things better. They probably aren’t fully 
optimal as we can't maintain 10g wire speed consistently across the full LUN 
but we're getting about 75% which is certainly better than we were seeing 
before we started looking at this.

Thanks,

Mark.
We are very much open to improvements and we'll definitely take a more 
detailed look at your patches in due course. We are always very happy to 
have more people working on GFS2,


Steve.


-Original Message-
From: Bob Peterson 
Sent: 20 September 2018 18:18
To: Mark Syms 
Cc: cluster-devel@redhat.com; Ross Lagerwall ; Tim Smith 

Subject: Re: [Cluster-devel] [PATCH 0/2] GFS2: inplace_reserve performance 
improvements

- Original Message -

While testing GFS2 as a storage repository for virtual machines we
discovered a number of scenarios where the performance was being
pathologically poor.

The scenarios are simplfied to the following -

   * On a single host in the cluster grow a number of files to a
 significant proportion of the filesystems LUN size, exceeding the
 hosts preferred resource group allocation. This can be replicated
 by using fio and writing to 20 different files with a script like

Hi Mark, Tim and all,

The performance problems with rgrp contention are well known, and have been for 
a very long time.

In rhel6 it's not as big of a problem because rhel6 gfs2 uses "try locks"
which distributes different processes to unique rgrps, thus keeping them from 
contending. However, it results in file system fragmentation that tends to 
catch up with you later.

I posted a different patch set to solve the problem a different way by trying 
to keep track of both Inter-node and Intra-node contention, and redistributed 
rgrps accordingly. It was similar to your first patch, but used a more 
predictable distribution, whereas yours is random.
It worked very well, but it ultimately got rejected by Steve Whitehouse in 
favor of a better approach:

Our current plan is to allow rgrps to be shared among many processes on a single node. 
This alleviates the contention, improves throughput and performance, and fixes the 
"favoritism" problems gfs2 has today.
In other words, it's better than just redistributing the rgrps.

I did a proof-of-concept set of patches and saw pretty good performance numbers and 
"fairness" among simultaneous writers. I posted that a few months ago.

Your patch would certainly work, and random distribution of rgrps would 
definitely gain performance, just as the Orlov algorithm does, however, I still 
want to pursue what Steve suggested.

My patch set for 

Re: [Cluster-devel] Sending patches for GFS2

2018-09-20 Thread Steven Whitehouse

Hi,


On 20/09/18 14:53, Mark Syms wrote:


We have a couple of patches for GFS2 which address some performance 
issues we’ve observed in our testing. What branch would you like the 
patches to be based on top off (we have a custom patched build so 
they’ll need to be realigned first).


Thanks,

    Mark.



The best thing is to base them off the latest upstream Linus kernel or 
from the gfs2 development tree. They are pretty similar and we can fix 
up any conflicts in due course, but it probably won't be necessary to do 
any additional changes,


Steve.



Re: [Cluster-devel] [GFS2 PATCH 0/4] Speed up journal head lookup

2018-09-07 Thread Steven Whitehouse

Hi,


On 06/09/18 18:02, Abhi Das wrote:

This is the upstream version of the rhel7 patchset I'd
posted earlier for review.

It is slightly different in parts owing to some bits
already being present and the hash/crc computation code
being different due to the updated log header structure.

Cheers!
--Abhi
Looks good. Thanks for sorting this out, it should be a good base on 
which to build future improvements. Lets make sure it gets lots of testing,


Steve.



*** BLURB HERE ***

Abhi Das (4):
   gfs2: add timing info to map_journal_extents
   gfs2: changes to gfs2_log_XXX_bio
   gfs2: add a helper function to get_log_header that can be used
 elsewhere
   gfs2: read journal in large chunks to locate the head

  fs/gfs2/bmap.c   |   8 ++-
  fs/gfs2/incore.h |   8 ++-
  fs/gfs2/log.c|   4 +-
  fs/gfs2/lops.c   | 142 ---
  fs/gfs2/lops.h   |   3 +-
  fs/gfs2/ops_fstype.c |   1 +
  fs/gfs2/recovery.c   | 168 +--
  fs/gfs2/recovery.h   |   2 +
  8 files changed, 184 insertions(+), 152 deletions(-)





Re: [Cluster-devel] [GFS2 PATCH] gfs2: don't hold the sd_jindex_spin during recovery

2018-08-20 Thread Steven Whitehouse

Hi,


On 17/08/18 16:53, Bob Peterson wrote:

Hi,

The sd_jindex_spin is used to serialize access to the sd_jindex_list.
Before this patch function gfs2_recover_set would hold the
spin_lock while recovery is running. Since recovery may take a very
long time, other processes needing to use the list would
monopolize a CPU for a very long time, spinning. This patch allows
it to unlock the spin_lock before calling gfs2_recover_journal.
The test_and_set_bit there should prevent multiple processes from
trying to recover the same journal.

This is only a problem when multiple processes attempt recovery,
which is possible via (1) a uevent kicking a 1 into the sysfs file
/sys/fs/gfs2//lock_module/recover, while the gfs2_control_func
in lock_dlm also calls gfs2_recover_set().

Signed-off-by: Bob Peterson 
---
  fs/gfs2/sys.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index 0c2a60fa66d7f..9fcb66d882b45 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -424,8 +424,8 @@ int gfs2_recover_set(struct gfs2_sbd *sdp, unsigned jid)
list_for_each_entry(jd, >sd_jindex_list, jd_list) {
if (jd->jd_jid != jid && !sdp->sd_args.ar_spectator)
continue;
-   rv = gfs2_recover_journal(jd, false);
-   break;
+   spin_unlock(>sd_jindex_spin);
+   return gfs2_recover_journal(jd, false);
Since the wait parameter is false here, all gfs2_recover_journal does is 
queue some work, and that should not block. Also it breaks the locking 
between the JDF_RECOVERY flag and the queuing of the work too,


Steve.


}
  out:
spin_unlock(>sd_jindex_spin);





Re: [Cluster-devel] RFC: growing files more than 4k at the time while writing

2018-08-16 Thread Steven Whitehouse

Hi,


On 16/08/18 16:05, Stefano Panella wrote:

Hi Steven,

Thanks for your reply,

unfortunately for us we need to wait a bit until we can take all the iomap 
changes because we are still on a 4.4 kernel and to backport the iomap 
framework would be huge.

Regarding fixing the application to do fallocate 16MBs at the time, the problem 
is that the fallocate needs to write zeros (which my patch had removed for the 
internal fallocate) and we would be double writing all the data to the filer 
compared to what we do today.
Yes, that is a consequence of GFS2's metadata not being able to specify 
unwritten extents.




Can you suggest anything we could do from gfs2_write_iter where we know all the 
size that need to be allocated to allocate the size in one go without zeroing 
the data since we are just about to write it instead of calling 
gfs2_inplace_reserve for every 4k?

Not without either iomap or fallocate



Also, do you think that because I am skipping the sb_issue_zeroout() when 
calling fallocate from write_iter I am possibly introducing filesystem/vfs 
corruption?
It will not corrupt anything to do this. You may however find that if 
there is a power cut, any file that was being fallocated or written at 
the time now has data that relates to some other file, that has been 
deleted in the past, rather than what was being written at the time of 
the power cut. So it is more a security thing really. Still, it is 
definitely not recommended that you do this, even if it is faster,


Steve.



Thanks a lot for your help?

Stefano

From: Steven Whitehouse 
Sent: Thursday, August 16, 2018 3:03 PM
To: Stefano Panella; cluster-devel@redhat.com; rpete...@redhat.com; 
agrue...@redhat.com
Cc: Edvin Torok; Tim Smith; Mark Syms; Ross Lagerwall
Subject: Re: RFC: growing files more than 4k at the time while writing

Hi,


On 16/08/18 13:56, Stefano Panella wrote:

Hi,

I am looking at gfs2 performance when there is Virtual Machine workload and 
have some questions:

If we have a system where every host is writing sequentially, o_direct, 512k  
at the time, 20 files or more each host, we can see very high CPU usage and lot 
of contention on resource groups.
In particular we can see that for every gfs2_file_write_iter (512k) there are 
many gfs2_write_begin and many gfs2_inplace_reserve (4k each).

When you extend a file with o_direct, then gfs2 uses a buffered write to
complete that I/O. It is only truely o_direct for in-place writes. This
is fairly common for filesystems, and normally you'd call fallocate to
extend the file and then write into it with o_direct once it has been
extended.


I have attempted to mitigate this problem with the following patch and I would 
like to know your opinion.

Does the patch look correct?
Is there any more lock to be taken?
Is it fundamentally wrong calling fallocate from write_iter?

This is not the right solution. You can call fallocate from your
application if that is what is required. There is no need to call it
from the kernel.


When the following patch is applied with allocation_quantum = 16 MB basically 
we can max out few 10Gb links when writing and growing many files from 
different hosts so a similar mechanism would be very usefull to improve 
performance but I am not sure it has been implemented in the best way (probably 
not).

Thanks a lot for all your help,

Since you are doing streaming writes to these files, you may see
significant improvement in performance with the iomap changes that have
just been merged upstream in the current merge window,

Steve.


Stefano

commit cf7824f08a431ad5a2e1e2d20499734f0632b12d
Author: Stefano Panella 
Date:   Wed Aug 15 15:32:52 2018 +

  Add allocation_quantum to gfs2_write_iter

  On gfs2_write_iter we know the size of the write and how much
  more space we would need to complete the write but this information
  is not used and instead all the space needed will be allocate 4kB
  at the time from gfs2_write_begin. This behaviour is causing a massive
  contention while growing hundereds of files from different nodes.

  In an attempt to mitigate this problem a module parameter has been
  added to configure a different allocation behaviour in gfs2_write_iter.

  The module parameter is called gfs2_allocation_quantum and it has got
  the following semantic:

-1: will not attempt to fallocate and not change the existing behaviour
 0: (default) will only fallocate without zeroing the part which is
going to be written any way
>0: same as zero but will round up the allocation size by this value
interpreted as as kBytes. This can remove substantially the cost
of growing files at the expense of wasting more storage and having
part of the fallocated region not initialised. This option is meant
to help the use case where every file is backing up a Virtual 
Mach

Re: [Cluster-devel] [GFS2 PATCH] gfs2: improve debug information when lvb mismatches are found

2018-08-16 Thread Steven Whitehouse

Hi,


On 15/08/18 18:13, Bob Peterson wrote:

Hi,

Before this patch, gfs2_rgrp_bh_get would check for lvb mismatches,
but it wouldn't tell you what was actually wrong. This patch adds
more information to help us debug it. It also makes rgrp consistency
checks dump any bad rgrps, and the rgrp dump code dump any lvbs
as well as the rgrp itself.

Yes, that is a good plan.
Acked-by: Steven Whitehouse 

Steve.


Signed-off-by: Bob Peterson 
---
  fs/gfs2/rgrp.c | 41 -
  fs/gfs2/util.c |  3 +++
  2 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 6eec634eae2d..4bb846af04e7 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1103,12 +1103,35 @@ static int gfs2_rgrp_lvb_valid(struct gfs2_rgrpd *rgd)
  {
struct gfs2_rgrp_lvb *rgl = rgd->rd_rgl;
struct gfs2_rgrp *str = (struct gfs2_rgrp 
*)rgd->rd_bits[0].bi_bh->b_data;
+   int valid = 1;
  
-	if (rgl->rl_flags != str->rg_flags || rgl->rl_free != str->rg_free ||

-   rgl->rl_dinodes != str->rg_dinodes ||
-   rgl->rl_igeneration != str->rg_igeneration)
-   return 0;
-   return 1;
+   if (rgl->rl_flags != str->rg_flags) {
+   printk(KERN_WARNING "GFS2: rgd: %llu lvb flag mismatch %u/%u",
+  (unsigned long long)rgd->rd_addr,
+  be32_to_cpu(rgl->rl_flags), be32_to_cpu(str->rg_flags));
+   valid = 0;
+   }
+   if (rgl->rl_free != str->rg_free) {
+   printk(KERN_WARNING "GFS2: rgd: %llu lvb free mismatch %u/%u",
+  (unsigned long long)rgd->rd_addr,
+  be32_to_cpu(rgl->rl_free), be32_to_cpu(str->rg_free));
+   valid = 0;
+   }
+   if (rgl->rl_dinodes != str->rg_dinodes) {
+   printk(KERN_WARNING "GFS2: rgd: %llu lvb dinode mismatch %u/%u",
+  (unsigned long long)rgd->rd_addr,
+  be32_to_cpu(rgl->rl_dinodes),
+  be32_to_cpu(str->rg_dinodes));
+   valid = 0;
+   }
+   if (rgl->rl_igeneration != str->rg_igeneration) {
+   printk(KERN_WARNING "GFS2: rgd: %llu lvb igen mismatch "
+  "%llu/%llu", (unsigned long long)rgd->rd_addr,
+  (unsigned long long)be64_to_cpu(rgl->rl_igeneration),
+  (unsigned long long)be64_to_cpu(str->rg_igeneration));
+   valid = 0;
+   }
+   return valid;
  }
  
  static u32 count_unlinked(struct gfs2_rgrpd *rgd)

@@ -2243,6 +2266,14 @@ void gfs2_rgrp_dump(struct seq_file *seq, const struct 
gfs2_glock *gl)
   (unsigned long long)rgd->rd_addr, rgd->rd_flags,
   rgd->rd_free, rgd->rd_free_clone, rgd->rd_dinodes,
   rgd->rd_reserved, rgd->rd_extfail_pt);
+   if (rgd->rd_sbd->sd_args.ar_rgrplvb) {
+   struct gfs2_rgrp_lvb *rgl = rgd->rd_rgl;
+
+   gfs2_print_dbg(seq, "  L: f:%02x b:%u i:%u\n",
+  be32_to_cpu(rgl->rl_flags),
+  be32_to_cpu(rgl->rl_free),
+  be32_to_cpu(rgl->rl_dinodes));
+   }
spin_lock(>rd_rsspin);
for (n = rb_first(>rd_rstree); n; n = rb_next(>rs_node)) {
trs = rb_entry(n, struct gfs2_blkreserv, rs_node);
diff --git a/fs/gfs2/util.c b/fs/gfs2/util.c
index 59c811de0dc7..b072b10fb635 100644
--- a/fs/gfs2/util.c
+++ b/fs/gfs2/util.c
@@ -19,6 +19,7 @@
  #include "gfs2.h"
  #include "incore.h"
  #include "glock.h"
+#include "rgrp.h"
  #include "util.h"
  
  struct kmem_cache *gfs2_glock_cachep __read_mostly;

@@ -181,6 +182,8 @@ int gfs2_consist_rgrpd_i(struct gfs2_rgrpd *rgd, int 
cluster_wide,
  {
struct gfs2_sbd *sdp = rgd->rd_sbd;
int rv;
+
+   gfs2_rgrp_dump(NULL, rgd->rd_gl);
rv = gfs2_lm_withdraw(sdp,
  "fatal: filesystem consistency error\n"
  "  RG = %llu\n"





Re: [Cluster-devel] RFC: growing files more than 4k at the time while writing

2018-08-16 Thread Steven Whitehouse

Hi,


On 16/08/18 13:56, Stefano Panella wrote:

Hi,

I am looking at gfs2 performance when there is Virtual Machine workload and 
have some questions:

If we have a system where every host is writing sequentially, o_direct, 512k  
at the time, 20 files or more each host, we can see very high CPU usage and lot 
of contention on resource groups.
In particular we can see that for every gfs2_file_write_iter (512k) there are 
many gfs2_write_begin and many gfs2_inplace_reserve (4k each).
When you extend a file with o_direct, then gfs2 uses a buffered write to 
complete that I/O. It is only truely o_direct for in-place writes. This 
is fairly common for filesystems, and normally you'd call fallocate to 
extend the file and then write into it with o_direct once it has been 
extended.




I have attempted to mitigate this problem with the following patch and I would 
like to know your opinion.

Does the patch look correct?
Is there any more lock to be taken?
Is it fundamentally wrong calling fallocate from write_iter?
This is not the right solution. You can call fallocate from your 
application if that is what is required. There is no need to call it 
from the kernel.




When the following patch is applied with allocation_quantum = 16 MB basically 
we can max out few 10Gb links when writing and growing many files from 
different hosts so a similar mechanism would be very usefull to improve 
performance but I am not sure it has been implemented in the best way (probably 
not).

Thanks a lot for all your help,
Since you are doing streaming writes to these files, you may see 
significant improvement in performance with the iomap changes that have 
just been merged upstream in the current merge window,


Steve.



Stefano

commit cf7824f08a431ad5a2e1e2d20499734f0632b12d
Author: Stefano Panella 
Date:   Wed Aug 15 15:32:52 2018 +

 Add allocation_quantum to gfs2_write_iter

 On gfs2_write_iter we know the size of the write and how much
 more space we would need to complete the write but this information
 is not used and instead all the space needed will be allocate 4kB
 at the time from gfs2_write_begin. This behaviour is causing a massive
 contention while growing hundereds of files from different nodes.

 In an attempt to mitigate this problem a module parameter has been
 added to configure a different allocation behaviour in gfs2_write_iter.

 The module parameter is called gfs2_allocation_quantum and it has got
 the following semantic:

   -1: will not attempt to fallocate and not change the existing behaviour
0: (default) will only fallocate without zeroing the part which is
   going to be written any way
   >0: same as zero but will round up the allocation size by this value
   interpreted as as kBytes. This can remove substantially the cost
   of growing files at the expense of wasting more storage and having
   part of the fallocated region not initialised. This option is meant
   to help the use case where every file is backing up a Virtual Machine
   qcow2 image for example where the file will grow linearly all the
   time and is potentially going to be huge. For the fact that the newly
   allocated region is uninitialised the image format wil make sure that
   the guest will never see that.

 The way the change has been implemented was to refactor the gfs2_fallocate
 function to get also a flags parameter which can be set to include
 GFS2_FALLOCATE_NO_ZERO in order to avoid writing zeros to the allocated
 region. In case of the fallocate called from gfs2_write_iter the flag is
 set but is not set otherwise so the behaviour of a normal fallocate will
 be to still zero the range.

 The performance improvement of the use case of many concurrent writes from
 different nodes of very big files grown linearly is massive.

 I am including the fio job which we have run on every host concurrently but
 on different set of files

 [ten-files]
 directory=a0:a1:a2:a3:a4:a5:a6:a7:a8:a9
 nrfiles=1
 size=22G
 bs=256k
 rw=write
 buffered=0
 ioengine=libaio
 fallocate=none
 overwrite=1
 numjobs=10

 Signed-off-by: Stefano Panella 

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 4c0ebff..9db9105 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -26,6 +26,7 @@
  #include 
  #include 
  #include 
+#include 


  #include "gfs2.h"
  #include "incore.h"
@@ -41,6 +42,17 @@
  #include "trans.h"
  #include "util.h"

+static int gfs2_allocation_quantum __read_mostly = 0;
+module_param_named(allocation_quantum, gfs2_allocation_quantum, int, 0644);
+MODULE_PARM_DESC(allocation_quantum, "Allocation quantum for gfs2 writes in kBytes, 
"
+"-1 will not attempt to fallocate, "
+"0 will only fallocate without zeroing the part which is going to 
be written any way, "
+

Re: [Cluster-devel] [GFS2 PATCH] gfs2: Always update the in-core rgrp when the buffer_head is read

2018-08-16 Thread Steven Whitehouse

Hi,


On 15/08/18 18:04, Bob Peterson wrote:

Hi,

Before this patch, function gfs2_rgrp_bh_get would only update the
rgrp info from the buffer if the GFS2_RDF_UPTODATE flag was clear.
This went back to the days when gfs2 kept rgrp version numbers, and
re-read the buffer_heads constantly, not just when needed.

The problem is, RDF_UPTODATE is a local flag, but lvbs are changed
dynamically throughout the cluster. This is a serious problem when
using the rgrplvb mount option because of scenarios like this:

1. Node A mounts the file system, sets RDF_UPTODATE for rgrp X.
2. Node B mounts the file system, sets RDF_UPTODATE for rgrp X.
3. Node A deletes a large file, freeing up lots of blocks,
so the lvb gets updated.
At this point Node B must have invalidated it's copy of the rgrp, since 
Node A must have an exclusive lock. That should have cleared the 
GFS2_RDF_UPTODATE flag. So why did that not happen?

4. Node B now re-reads the rgrp, but because it's marked UPTODATE,
it decides not to update its in-core copy of rgrp X.
Why is it marked up to date when rgrp_go_inval() should have cleared it 
when the rgrp lock was demoted to allow Node A it's exclusive lock?




At this point, Node B will have the wrong value for rgd->rd_free,
the amount of free space in the rgrp.

But there's no good reason not to grab the most recent values from
the buffer: it only costs us a few cpu cycles to read them.

This patch removes the UPTODATE check in favor of just always
reading the rgrp values in from the buffer we just read.

Signed-off-by: Bob Peterson 
---
  fs/gfs2/rgrp.c | 17 -
  1 file changed, 8 insertions(+), 9 deletions(-)
I think we should understand why the flag is not set correctly. If it is 
not working correctly for this case, then why can we still trust it for 
the other check in update_rgrp_lvb() ?


Steve.


diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 1ad3256b9cbc..6eec634eae2d 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1177,15 +1177,14 @@ static int gfs2_rgrp_bh_get(struct gfs2_rgrpd *rgd)
}
}
  
-	if (!(rgd->rd_flags & GFS2_RDF_UPTODATE)) {

-   for (x = 0; x < length; x++)
-   clear_bit(GBF_FULL, >rd_bits[x].bi_flags);
-   gfs2_rgrp_in(rgd, (rgd->rd_bits[0].bi_bh)->b_data);
-   rgd->rd_flags |= (GFS2_RDF_UPTODATE | GFS2_RDF_CHECK);
-   rgd->rd_free_clone = rgd->rd_free;
-   /* max out the rgrp allocation failure point */
-   rgd->rd_extfail_pt = rgd->rd_free;
-   }
+   for (x = 0; x < length; x++)
+   clear_bit(GBF_FULL, >rd_bits[x].bi_flags);
+   gfs2_rgrp_in(rgd, (rgd->rd_bits[0].bi_bh)->b_data);
+   rgd->rd_flags |= (GFS2_RDF_UPTODATE | GFS2_RDF_CHECK);
+   rgd->rd_free_clone = rgd->rd_free;
+   /* max out the rgrp allocation failure point */
+   rgd->rd_extfail_pt = rgd->rd_free;
+
if (cpu_to_be32(GFS2_MAGIC) != rgd->rd_rgl->rl_magic) {
rgd->rd_rgl->rl_unlinked = cpu_to_be32(count_unlinked(rgd));
gfs2_rgrp_ondisk2lvb(rgd->rd_rgl,





Re: [Cluster-devel] [GFS2 PATCH] GFS2: Simplify iterative add loop in foreach_descriptor

2018-08-10 Thread Steven Whitehouse

Hi,


On 10/08/18 13:13, Andreas Gruenbacher wrote:

On 9 August 2018 at 11:35, Steven Whitehouse  wrote:

Hi,



On 08/08/18 19:52, Bob Peterson wrote:

Hi,

Before this patch, function foreach_descriptor repeatedly called
function gfs2_replay_incr_blk which just incremented the value while
decrementing another, and checked for wrap. This is a waste of time.
This patch just adds the value and adjusts it if a wrap occurred.

Signed-off-by: Bob Peterson 
---
   fs/gfs2/recovery.c | 5 +++--
   1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index 0f501f938d1c..6c6b19263b82 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -354,8 +354,9 @@ static int foreach_descriptor(struct gfs2_jdesc *jd,
unsigned int start,
 return error;
 }
   - while (length--)
-   gfs2_replay_incr_blk(jd, );
+   start += length;
+   if (start >= jd->jd_blocks)
+   start -= jd->jd_blocks;
 brelse(bh);
 }


Now you've hidden the increment of the replay block. Please don't open code
this, but just add an argument to gfs2_replay_incr_blk() such that you can
tell it how many blocks to increment, rather than just assuming a single
block as it does at the moment. Otherwise this can easily get missed when
someone looks at the code in future, and expects gfs2_replay_incr_blk to be
the only thing that changes the position during recovery,

If we really want to encapsulate "add modulo jd->jd_blocks", it's also
open-coded in find_good_lh and jhead_scan.

Andreas


I wonder if those will go away with Abhi's patch set in due course?

Steve.



  1   2   3   4   5   6   7   8   9   10   >