Re: [Cluster-devel] [PATCH V13 00/19] block: support multi-page bvec

2019-01-14 Thread Jens Axboe
On 1/11/19 4:01 AM, Ming Lei wrote:
> Hi,
> 
> This patchset brings multi-page bvec into block layer:
> 
> 1) what is multi-page bvec?
> 
> Multipage bvecs means that one 'struct bio_bvec' can hold multiple pages
> which are physically contiguous instead of one single page used in linux
> kernel for long time.
> 
> 2) why is multi-page bvec introduced?
> 
> Kent proposed the idea[1] first. 
> 
> As system's RAM becomes much bigger than before, and huge page, transparent
> huge page and memory compaction are widely used, it is a bit easy now
> to see physically contiguous pages from fs in I/O. On the other hand, from
> block layer's view, it isn't necessary to store intermediate pages into bvec,
> and it is enough to just store the physicallly contiguous 'segment' in each
> io vector.
> 
> Also huge pages are being brought to filesystem and swap [2][6], we can
> do IO on a hugepage each time[3], which requires that one bio can transfer
> at least one huge page one time. Turns out it isn't flexiable to change
> BIO_MAX_PAGES simply[3][5]. Multipage bvec can fit in this case very well.
> As we saw, if CONFIG_THP_SWAP is enabled, BIO_MAX_PAGES can be configured
> as much bigger, such as 512, which requires at least two 4K pages for holding
> the bvec table.
> 
> With multi-page bvec:
> 
> - Inside block layer, both bio splitting and sg map can become more
> efficient than before by just traversing the physically contiguous
> 'segment' instead of each page.
> 
> - segment handling in block layer can be improved much in future since it
> should be quite easy to convert multipage bvec into segment easily. For
> example, we might just store segment in each bvec directly in future.
> 
> - bio size can be increased and it should improve some high-bandwidth IO
> case in theory[4].
> 
> - there is opportunity in future to improve memory footprint of bvecs. 
> 
> 3) how is multi-page bvec implemented in this patchset?
> 
> Patch 1 ~ 4 parpares for supporting multi-page bvec. 
> 
> Patches 5 ~ 15 implement multipage bvec in block layer:
> 
>   - put all tricks into bvec/bio/rq iterators, and as far as
>   drivers and fs use these standard iterators, they are happy
>   with multipage bvec
> 
>   - introduce bio_for_each_bvec() to iterate over multipage bvec for 
> splitting
>   bio and mapping sg
> 
>   - keep current bio_for_each_segment*() to itereate over singlepage bvec 
> and
>   make sure current users won't be broken; especailly, convert to this
>   new helper prototype in single patch 21 given it is bascially a 
> mechanism
>   conversion
> 
>   - deal with iomap & xfs's sub-pagesize io vec in patch 13
> 
>   - enalbe multipage bvec in patch 14 
> 
> Patch 16 redefines BIO_MAX_PAGES as 256.
> 
> Patch 17 documents usages of bio iterator helpers.
> 
> Patch 18~19 kills NO_SG_MERGE.
> 
> These patches can be found in the following git tree:
> 
>   git:  https://github.com/ming1/linux.git  for-4.21-block-mp-bvec-V12
> 
> Lots of test(blktest, xfstests, ltp io, ...) have been run with this patchset,
> and not see regression.
> 
> Thanks Christoph for reviewing the early version and providing very good
> suggestions, such as: introduce bio_init_with_vec_table(), remove another
> unnecessary helpers for cleanup and so on.
> 
> Thanks Chritoph and Omar for reviewing V10/V11/V12, and provides lots of
> helpful comments.

Thanks for persisting in this endeavor, Ming, I've applied this for 5.1.

-- 
Jens Axboe



Re: [Cluster-devel] [PATCH 09/11] gfs2: Read bitmap buffers on demand

2019-01-14 Thread Andreas Gruenbacher
On Mon, 14 Jan 2019 at 19:41, Andreas Gruenbacher  wrote:
>
> Bob,
>
> On Mon, 14 Jan 2019 at 18:12, Bob Peterson  wrote:
> > - Original Message -
> > > Before this patch, when locking a resource group, gfs2 would read in the
> > > resource group header and all the bitmap buffers of the resource group.
> > > Those buffers would then be locked into memory until the resource group
> > > is unlocked, which will happen when the filesystem is unmounted or when
> > > transferring the resource group lock to another node, but not due to
> > > memory pressure.  Larger resource groups lock more buffers into memory,
> > > and cause more unnecessary I/O when resource group locks are transferred
> > > between nodes.
> > >
> > > With this patch, when locking a resource group, only the resource group
> > > header is read in.  The other bitmap buffers (the resource group header
> > > contains part of the bitmap) are only read in on demand.
> > >
> > > It would probably make sense to also only read in the resource group
> > > header on demand, when the resource group is modified, but since the
> > > header contains the number of free blocks in the resource group, there
> > > is a higher incentive to keep the header locked in memory after that.
> > >
> > > Signed-off-by: Andreas Gruenbacher 
> > > ---
> >
> > Hi,
> >
> > This patch looks good. However, I'm concerned about its performance,
> > at least until we get some bitmap read-ahead in place.
> > (see "/* FIXME: Might want to do read-ahead here. */")
> >
> > In particular, I'm concerned that it does the exact opposite of this
> > performance enhancement:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?id=39b0f1e9290880a6c905f639e7db6b646e302a4f
> >
> > So we need to test that 2GB rgrps (33 bitmap blocks) that are fragmented
> > don't have a disproportionate slowdown compared to, say, 128MB rgrps
> > because we end up spending so much time fetching pages out of page cache,
> > only to find them inadequate and doing brelse() again. And by fragmented,
> > I mean all 33 bitmap blocks have a small number of free blocks, thus
> > disqualifying them from having "full" status but still having a large
> > enough number of free blocks that we might consider searching them.
> > Some of the patches we did after that one may have mitigated the problem
> > to some extent, but we need to be sure we're not regressing performance.
>
> Yes, there are a few ways this patch can still be tuned depending on
> the performance impact. This should speed up "normal" workloads by
> requiring fewer reads from disk when acquiring a resource group glock,
> and less memory is pinned for filesystem metadata, but we do end up
> with more page cache lookups during allocations, and depending on
> read-ahead, possibly more smaller reads at inconvenient times.
>
> rd_last_alloc should help limit the performance impact when bitmaps
> become mostly full. (Forms of rd_last_alloc have been in gfs2 since
> the mainline merge.)

Hmm, rd_extfail_pt actually, not rd_last_alloc, and that predates the
optimization in commit 39b0f1e929 by less than two years.

> One way to further reduce the number of page cache lookups would be to
> cache the buffer head of the last allocation, but it's unclear if this
> is even an issue, so let's do some performance testing first.

Andreas



Re: [Cluster-devel] [PATCH 09/11] gfs2: Read bitmap buffers on demand

2019-01-14 Thread Andreas Gruenbacher
Bob,

On Mon, 14 Jan 2019 at 18:12, Bob Peterson  wrote:
> - Original Message -
> > Before this patch, when locking a resource group, gfs2 would read in the
> > resource group header and all the bitmap buffers of the resource group.
> > Those buffers would then be locked into memory until the resource group
> > is unlocked, which will happen when the filesystem is unmounted or when
> > transferring the resource group lock to another node, but not due to
> > memory pressure.  Larger resource groups lock more buffers into memory,
> > and cause more unnecessary I/O when resource group locks are transferred
> > between nodes.
> >
> > With this patch, when locking a resource group, only the resource group
> > header is read in.  The other bitmap buffers (the resource group header
> > contains part of the bitmap) are only read in on demand.
> >
> > It would probably make sense to also only read in the resource group
> > header on demand, when the resource group is modified, but since the
> > header contains the number of free blocks in the resource group, there
> > is a higher incentive to keep the header locked in memory after that.
> >
> > Signed-off-by: Andreas Gruenbacher 
> > ---
>
> Hi,
>
> This patch looks good. However, I'm concerned about its performance,
> at least until we get some bitmap read-ahead in place.
> (see "/* FIXME: Might want to do read-ahead here. */")
>
> In particular, I'm concerned that it does the exact opposite of this
> performance enhancement:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?id=39b0f1e9290880a6c905f639e7db6b646e302a4f
>
> So we need to test that 2GB rgrps (33 bitmap blocks) that are fragmented
> don't have a disproportionate slowdown compared to, say, 128MB rgrps
> because we end up spending so much time fetching pages out of page cache,
> only to find them inadequate and doing brelse() again. And by fragmented,
> I mean all 33 bitmap blocks have a small number of free blocks, thus
> disqualifying them from having "full" status but still having a large
> enough number of free blocks that we might consider searching them.
> Some of the patches we did after that one may have mitigated the problem
> to some extent, but we need to be sure we're not regressing performance.

Yes, there are a few ways this patch can still be tuned depending on
the performance impact. This should speed up "normal" workloads by
requiring fewer reads from disk when acquiring a resource group glock,
and less memory is pinned for filesystem metadata, but we do end up
with more page cache lookups during allocations, and depending on
read-ahead, possibly more smaller reads at inconvenient times.

rd_last_alloc should help limit the performance impact when bitmaps
become mostly full. (Forms of rd_last_alloc have been in gfs2 since
the mainline merge.)

One way to further reduce the number of page cache lookups would be to
cache the buffer head of the last allocation, but it's unclear if this
is even an issue, so let's do some performance testing first.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH 09/11] gfs2: Read bitmap buffers on demand

2019-01-14 Thread Bob Peterson
- Original Message -
> Before this patch, when locking a resource group, gfs2 would read in the
> resource group header and all the bitmap buffers of the resource group.
> Those buffers would then be locked into memory until the resource group
> is unlocked, which will happen when the filesystem is unmounted or when
> transferring the resource group lock to another node, but not due to
> memory pressure.  Larger resource groups lock more buffers into memory,
> and cause more unnecessary I/O when resource group locks are transferred
> between nodes.
> 
> With this patch, when locking a resource group, only the resource group
> header is read in.  The other bitmap buffers (the resource group header
> contains part of the bitmap) are only read in on demand.
> 
> It would probably make sense to also only read in the resource group
> header on demand, when the resource group is modified, but since the
> header contains the number of free blocks in the resource group, there
> is a higher incentive to keep the header locked in memory after that.
> 
> Signed-off-by: Andreas Gruenbacher 
> ---

Hi,

This patch looks good. However, I'm concerned about its performance,
at least until we get some bitmap read-ahead in place.
(see "/* FIXME: Might want to do read-ahead here. */")

In particular, I'm concerned that it does the exact opposite of this
performance enhancement:

https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/fs/gfs2?id=39b0f1e9290880a6c905f639e7db6b646e302a4f

So we need to test that 2GB rgrps (33 bitmap blocks) that are fragmented
don't have a disproportionate slowdown compared to, say, 128MB rgrps
because we end up spending so much time fetching pages out of page cache,
only to find them inadequate and doing brelse() again. And by fragmented,
I mean all 33 bitmap blocks have a small number of free blocks, thus
disqualifying them from having "full" status but still having a large
enough number of free blocks that we might consider searching them.
Some of the patches we did after that one may have mitigated the problem
to some extent, but we need to be sure we're not regressing performance.

Regards,

Bob Peterson



Re: [Cluster-devel] [PATCH 11/11] gfs2: Skip gfs2_metatype_check for cached buffers

2019-01-14 Thread Bob Peterson
- Original Message -
> When reading in buffers from disk, set a new BH_Verify buffer head flag.  In
> gfs2_metatype_check, skip the check if BH_Verify is cleared and clear the
> flag
> when checking.  That way, we'll only check the metatype once when reading
> buffers from disk, and not when the buffers were already in the page cache.
> 
> While touching this code, convert two 'be32_to_cpu(magic) == GFS2_MAGIC'
> checks
> into 'magic == cpu_to_be32(GFS2_MAGIC)'.
> 
> Signed-off-by: Andreas Gruenbacher 
> ---
(snip)
>  static inline int gfs2_metatype_check(struct gfs2_sbd *sdp,
> - struct buffer_head *bh,
> - u16 type)
> +   struct buffer_head *bh, u16 type)
>  {
> - struct gfs2_meta_header *mh = (struct gfs2_meta_header *)bh->b_data;
> - u32 magic = be32_to_cpu(mh->mh_magic);
> - u16 t = be32_to_cpu(mh->mh_type);
> - if (unlikely(magic != GFS2_MAGIC))
> - return gfs2_meta_check_ii(sdp, bh, "magic number", (void 
> *)_RET_IP_);
> -if (unlikely(t != type))
> - return gfs2_metatype_check_ii(sdp, bh, type, t,
> -   (void *)_RET_IP_);
> - return 0;
> + if (!buffer_verify(bh))
> + return 0;
> + return gfs2_metatype_check_ii(sdp, bh, type, (void *)_RET_IP_);

May I suggest simplifying with:

if (!test_clear_buffer_verify(bh))
return 0;

Regards,

Bob Peterson



Re: [Cluster-devel] [PATCH 10/11] gfs2: Clean up assertion, consistency check, and error reporting functions

2019-01-14 Thread Bob Peterson
- Original Message -
> Instead of passing the __func__, __FILE__, and __LINE__ pre-processor macros
> to
> each of those functions, print the location of the caller via:
> 
>   printk(%pS", (void *)_RET_IP_).
> 
> This gives enough context information to locate where in the code an error
> occurred, and reduces the code size by about 2 percent.
> 
> Signed-off-by: Andreas Gruenbacher 
> ---

Hi,

Sorry, but I don't like this patch at all.

With all the different versions floating around, and further obscured by static
function optimization by the compiler, this patch just makes debugging harder.
Not impossible, but harder.

In other words, in my opinion, the benefit is not worth the cost.

Plus it has nothing to do with making the rgrp bitmaps read in on-demand.

Regards,

Bob Peterson