Re: [Cluster-devel] [PATCH V15 14/18] block: enable multipage bvecs

2019-02-27 Thread Ming Lei
On Wed, Feb 27, 2019 at 08:47:09PM +, Jon Hunter wrote:
> 
> On 21/02/2019 08:42, Marek Szyprowski wrote:
> > Dear All,
> > 
> > On 2019-02-15 12:13, Ming Lei wrote:
> >> This patch pulls the trigger for multi-page bvecs.
> >>
> >> Reviewed-by: Omar Sandoval 
> >> Signed-off-by: Ming Lei 
> > 
> > Since Linux next-20190218 I've observed problems with block layer on one
> > of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
> > this issue led me to this change. This is also the first linux-next
> > release with this change merged. The issue is fully reproducible and can
> > be observed in the following kernel log:
> > 
> > sdhci: Secure Digital Host Controller Interface driver
> > sdhci: Copyright(c) Pierre Ossman
> > s3c-sdhci 1253.sdhci: clock source 2: mmc_busclk.2 (1 Hz)
> > s3c-sdhci 1253.sdhci: Got CD GPIO
> > mmc0: SDHCI controller on samsung-hsmmc [1253.sdhci] using ADMA
> > mmc0: new high speed SDHC card at address 
> > mmcblk0: mmc0: SL16G 14.8 GiB
> I have also noticed some failures when writing to an eMMC device on one
> of our Tegra boards. We have a simple eMMC write/read test and it is
> currently failing because the data written does not match the source.
> 
> I did not seem the same crash as reported here, however, in our case the
> rootfs is NFS mounted and so probably would not. However, the bisect
> points to this commit and reverting on top of -next fixes the issues.

It is sdhci, probably related with max segment size, could you test the
following patch:

https://marc.info/?l=linux-mmc=155128334122951=2

Thanks,
Ming



Re: [Cluster-devel] [PATCH V15 14/18] block: enable multipage bvecs

2019-02-22 Thread Ming Lei
On Thu, Feb 21, 2019 at 11:22:39AM +0100, Marek Szyprowski wrote:
> Hi Ming,
> 
> On 2019-02-21 11:16, Ming Lei wrote:
> > On Thu, Feb 21, 2019 at 11:08:19AM +0100, Marek Szyprowski wrote:
> >> On 2019-02-21 10:57, Ming Lei wrote:
> >>> On Thu, Feb 21, 2019 at 09:42:59AM +0100, Marek Szyprowski wrote:
> >>>> On 2019-02-15 12:13, Ming Lei wrote:
> >>>>> This patch pulls the trigger for multi-page bvecs.
> >>>>>
> >>>>> Reviewed-by: Omar Sandoval 
> >>>>> Signed-off-by: Ming Lei 
> >>>> Since Linux next-20190218 I've observed problems with block layer on one
> >>>> of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
> >>>> this issue led me to this change. This is also the first linux-next
> >>>> release with this change merged. The issue is fully reproducible and can
> >>>> be observed in the following kernel log:
> >>>>
> >>>> sdhci: Secure Digital Host Controller Interface driver
> >>>> sdhci: Copyright(c) Pierre Ossman
> >>>> s3c-sdhci 1253.sdhci: clock source 2: mmc_busclk.2 (1 Hz)
> >>>> s3c-sdhci 1253.sdhci: Got CD GPIO
> >>>> mmc0: SDHCI controller on samsung-hsmmc [1253.sdhci] using ADMA
> >>>> mmc0: new high speed SDHC card at address 
> >>>> mmcblk0: mmc0: SL16G 14.8 GiB
> >>>>
> >>>> ...
> >>>>
> >>>> EXT4-fs (mmcblk0p2): INFO: recovery required on readonly filesystem
> >>>> EXT4-fs (mmcblk0p2): write access will be enabled during recovery
> >>>> EXT4-fs (mmcblk0p2): recovery complete
> >>>> EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: 
> >>>> (null)
> >>>> VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
> >>>> devtmpfs: mounted
> >>>> Freeing unused kernel memory: 1024K
> >>>> hub 1-3:1.0: USB hub found
> >>>> Run /sbin/init as init process
> >>>> hub 1-3:1.0: 3 ports detected
> >>>> *** stack smashing detected ***:  terminated
> >>>> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0004
> >>>> CPU: 1 PID: 1 Comm: init Not tainted 5.0.0-rc6-next-20190218 #1546
> >>>> Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
> >>>> [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
> >>>> [] (show_stack) from [] (dump_stack+0x90/0xc8)
> >>>> [] (dump_stack) from [] (panic+0xfc/0x304)
> >>>> [] (panic) from [] (do_exit+0xabc/0xc6c)
> >>>> [] (do_exit) from [] (do_group_exit+0x3c/0xbc)
> >>>> [] (do_group_exit) from [] (get_signal+0x130/0xbf4)
> >>>> [] (get_signal) from [] (do_work_pending+0x130/0x618)
> >>>> [] (do_work_pending) from []
> >>>> (slow_work_pending+0xc/0x20)
> >>>> Exception stack(0xe88c3fb0 to 0xe88c3ff8)
> >>>> 3fa0:  bea7787c 0005
> >>>> b6e8d0b8
> >>>> 3fc0: bea77a18 b6f92010 b6e8d0b8 0001 b6e8d0c8 0001 b6e8c000
> >>>> bea77b60
> >>>> 3fe0: 0020 bea77998  b6d52368 6050 
> >>>> CPU3: stopping
> >>>>
> >>>> I would like to help debugging and fixing this issue, but I don't really
> >>>> have idea where to start. Here are some more detailed information about
> >>>> my test system:
> >>>>
> >>>> 1. Board: ARM 32bit Samsung Exynos4412-based Odroid U3 (device tree
> >>>> source: arch/arm/boot/dts/exynos4412-odroidu3.dts)
> >>>>
> >>>> 2. Block device: MMC/SDHCI/SDHCI-S3C with SD card
> >>>> (drivers/mmc/host/sdhci-s3c.c driver, sdhci_2 device node in the device
> >>>> tree)
> >>>>
> >>>> 3. Rootfs: Ext4
> >>>>
> >>>> 4. Kernel config: arch/arm/configs/exynos_defconfig
> >>>>
> >>>> I can gather more logs if needed, just let me which kernel option to
> >>>> enable. Reverting this commit on top of next-20190218 as well as current
> >>>> linux-next (tested with next-20190221) fixes this issue and makes the
> >>>> system bootable again.
> >>> Could you test the patch in following link and see if it can make a 
> >>> difference?
> >>>
> >>> ht

Re: [Cluster-devel] [PATCH V15 14/18] block: enable multipage bvecs

2019-02-21 Thread Ming Lei
On Thu, Feb 21, 2019 at 11:08:19AM +0100, Marek Szyprowski wrote:
> Hi Ming,
> 
> On 2019-02-21 10:57, Ming Lei wrote:
> > On Thu, Feb 21, 2019 at 09:42:59AM +0100, Marek Szyprowski wrote:
> >> On 2019-02-15 12:13, Ming Lei wrote:
> >>> This patch pulls the trigger for multi-page bvecs.
> >>>
> >>> Reviewed-by: Omar Sandoval 
> >>> Signed-off-by: Ming Lei 
> >> Since Linux next-20190218 I've observed problems with block layer on one
> >> of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
> >> this issue led me to this change. This is also the first linux-next
> >> release with this change merged. The issue is fully reproducible and can
> >> be observed in the following kernel log:
> >>
> >> sdhci: Secure Digital Host Controller Interface driver
> >> sdhci: Copyright(c) Pierre Ossman
> >> s3c-sdhci 1253.sdhci: clock source 2: mmc_busclk.2 (1 Hz)
> >> s3c-sdhci 1253.sdhci: Got CD GPIO
> >> mmc0: SDHCI controller on samsung-hsmmc [1253.sdhci] using ADMA
> >> mmc0: new high speed SDHC card at address 
> >> mmcblk0: mmc0: SL16G 14.8 GiB
> >>
> >> ...
> >>
> >> EXT4-fs (mmcblk0p2): INFO: recovery required on readonly filesystem
> >> EXT4-fs (mmcblk0p2): write access will be enabled during recovery
> >> EXT4-fs (mmcblk0p2): recovery complete
> >> EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: 
> >> (null)
> >> VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
> >> devtmpfs: mounted
> >> Freeing unused kernel memory: 1024K
> >> hub 1-3:1.0: USB hub found
> >> Run /sbin/init as init process
> >> hub 1-3:1.0: 3 ports detected
> >> *** stack smashing detected ***:  terminated
> >> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0004
> >> CPU: 1 PID: 1 Comm: init Not tainted 5.0.0-rc6-next-20190218 #1546
> >> Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
> >> [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
> >> [] (show_stack) from [] (dump_stack+0x90/0xc8)
> >> [] (dump_stack) from [] (panic+0xfc/0x304)
> >> [] (panic) from [] (do_exit+0xabc/0xc6c)
> >> [] (do_exit) from [] (do_group_exit+0x3c/0xbc)
> >> [] (do_group_exit) from [] (get_signal+0x130/0xbf4)
> >> [] (get_signal) from [] (do_work_pending+0x130/0x618)
> >> [] (do_work_pending) from []
> >> (slow_work_pending+0xc/0x20)
> >> Exception stack(0xe88c3fb0 to 0xe88c3ff8)
> >> 3fa0:  bea7787c 0005
> >> b6e8d0b8
> >> 3fc0: bea77a18 b6f92010 b6e8d0b8 0001 b6e8d0c8 0001 b6e8c000
> >> bea77b60
> >> 3fe0: 0020 bea77998  b6d52368 6050 
> >> CPU3: stopping
> >>
> >> I would like to help debugging and fixing this issue, but I don't really
> >> have idea where to start. Here are some more detailed information about
> >> my test system:
> >>
> >> 1. Board: ARM 32bit Samsung Exynos4412-based Odroid U3 (device tree
> >> source: arch/arm/boot/dts/exynos4412-odroidu3.dts)
> >>
> >> 2. Block device: MMC/SDHCI/SDHCI-S3C with SD card
> >> (drivers/mmc/host/sdhci-s3c.c driver, sdhci_2 device node in the device
> >> tree)
> >>
> >> 3. Rootfs: Ext4
> >>
> >> 4. Kernel config: arch/arm/configs/exynos_defconfig
> >>
> >> I can gather more logs if needed, just let me which kernel option to
> >> enable. Reverting this commit on top of next-20190218 as well as current
> >> linux-next (tested with next-20190221) fixes this issue and makes the
> >> system bootable again.
> > Could you test the patch in following link and see if it can make a 
> > difference?
> >
> > https://marc.info/?l=linux-aio=155070355614541=2
> 
> I've tested that patch, but it doesn't make any difference on the test
> system. In the log I see no warning added by it.

I guess it might be related with memory corruption, could you enable the
following debug options and post the dmesg log?

CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_KASAN=y

Thanks,
Ming



Re: [Cluster-devel] [PATCH V15 14/18] block: enable multipage bvecs

2019-02-21 Thread Ming Lei
On Thu, Feb 21, 2019 at 09:42:59AM +0100, Marek Szyprowski wrote:
> Dear All,
> 
> On 2019-02-15 12:13, Ming Lei wrote:
> > This patch pulls the trigger for multi-page bvecs.
> >
> > Reviewed-by: Omar Sandoval 
> > Signed-off-by: Ming Lei 
> 
> Since Linux next-20190218 I've observed problems with block layer on one
> of my test devices (Odroid U3 with EXT4 rootfs on SD card). Bisecting
> this issue led me to this change. This is also the first linux-next
> release with this change merged. The issue is fully reproducible and can
> be observed in the following kernel log:
> 
> sdhci: Secure Digital Host Controller Interface driver
> sdhci: Copyright(c) Pierre Ossman
> s3c-sdhci 1253.sdhci: clock source 2: mmc_busclk.2 (1 Hz)
> s3c-sdhci 1253.sdhci: Got CD GPIO
> mmc0: SDHCI controller on samsung-hsmmc [1253.sdhci] using ADMA
> mmc0: new high speed SDHC card at address 
> mmcblk0: mmc0: SL16G 14.8 GiB
> 
> ...
> 
> EXT4-fs (mmcblk0p2): INFO: recovery required on readonly filesystem
> EXT4-fs (mmcblk0p2): write access will be enabled during recovery
> EXT4-fs (mmcblk0p2): recovery complete
> EXT4-fs (mmcblk0p2): mounted filesystem with ordered data mode. Opts: (null)
> VFS: Mounted root (ext4 filesystem) readonly on device 179:2.
> devtmpfs: mounted
> Freeing unused kernel memory: 1024K
> hub 1-3:1.0: USB hub found
> Run /sbin/init as init process
> hub 1-3:1.0: 3 ports detected
> *** stack smashing detected ***:  terminated
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x0004
> CPU: 1 PID: 1 Comm: init Not tainted 5.0.0-rc6-next-20190218 #1546
> Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
> [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
> [] (show_stack) from [] (dump_stack+0x90/0xc8)
> [] (dump_stack) from [] (panic+0xfc/0x304)
> [] (panic) from [] (do_exit+0xabc/0xc6c)
> [] (do_exit) from [] (do_group_exit+0x3c/0xbc)
> [] (do_group_exit) from [] (get_signal+0x130/0xbf4)
> [] (get_signal) from [] (do_work_pending+0x130/0x618)
> [] (do_work_pending) from []
> (slow_work_pending+0xc/0x20)
> Exception stack(0xe88c3fb0 to 0xe88c3ff8)
> 3fa0:  bea7787c 0005
> b6e8d0b8
> 3fc0: bea77a18 b6f92010 b6e8d0b8 0001 b6e8d0c8 0001 b6e8c000
> bea77b60
> 3fe0: 0020 bea77998  b6d52368 6050 
> CPU3: stopping
> 
> I would like to help debugging and fixing this issue, but I don't really
> have idea where to start. Here are some more detailed information about
> my test system:
> 
> 1. Board: ARM 32bit Samsung Exynos4412-based Odroid U3 (device tree
> source: arch/arm/boot/dts/exynos4412-odroidu3.dts)
> 
> 2. Block device: MMC/SDHCI/SDHCI-S3C with SD card
> (drivers/mmc/host/sdhci-s3c.c driver, sdhci_2 device node in the device
> tree)
> 
> 3. Rootfs: Ext4
> 
> 4. Kernel config: arch/arm/configs/exynos_defconfig
> 
> I can gather more logs if needed, just let me which kernel option to
> enable. Reverting this commit on top of next-20190218 as well as current
> linux-next (tested with next-20190221) fixes this issue and makes the
> system bootable again.

Could you test the patch in following link and see if it can make a difference?

https://marc.info/?l=linux-aio=155070355614541=2

Thanks,
Ming



Re: [Cluster-devel] [dm-devel] [PATCH V15 00/18] block: support multi-page bvec

2019-02-19 Thread Ming Lei
On Tue, Feb 19, 2019 at 08:28:19AM -0800, Bart Van Assche wrote:
> On Sun, 2019-02-17 at 21:11 +0800, Ming Lei wrote:
> > The following patch should fix this issue:
> > 
> > 
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index bed065904677..066b66430523 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -363,13 +363,15 @@ static unsigned int __blk_recalc_rq_segments(struct 
> > request_queue *q,
> > struct bio_vec bv, bvprv = { NULL };
> > int prev = 0;
> > unsigned int seg_size, nr_phys_segs;
> > -   unsigned front_seg_size = bio->bi_seg_front_size;
> > +   unsigned front_seg_size;
> > struct bio *fbio, *bbio;
> > struct bvec_iter iter;
> >  
> > if (!bio)
> > return 0;
> >  
> > +   front_seg_size = bio->bi_seg_front_size;
> > +
> > switch (bio_op(bio)) {
> > case REQ_OP_DISCARD:
> > case REQ_OP_SECURE_ERASE:
> 
> Hi Ming,
> 
> With this patch applied test nvmeof-mp/002 fails as follows:
> 
> [  694.700400] kernel BUG at lib/sg_pool.c:103!
> [  694.705932] invalid opcode:  [#1] PREEMPT SMP KASAN
> [  694.708297] CPU: 2 PID: 349 Comm: kworker/2:1H Tainted: GB 
> 5.0.0-rc6-dbg+ #2
> [  694.711730] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1 04/01/2014
> [  694.715113] Workqueue: kblockd blk_mq_run_work_fn
> [  694.716894] RIP: 0010:sg_alloc_table_chained+0xe5/0xf0
> [  694.758222] Call Trace:
> [  694.759645]  nvme_rdma_queue_rq+0x2aa/0xcc0 [nvme_rdma]
> [  694.764915]  blk_mq_try_issue_directly+0x2a5/0x4b0
> [  694.771779]  blk_insert_cloned_request+0x11e/0x1c0
> [  694.778417]  dm_mq_queue_rq+0x3d1/0x770
> [  694.793400]  blk_mq_dispatch_rq_list+0x5fc/0xb10
> [  694.798386]  blk_mq_sched_dispatch_requests+0x2f7/0x300
> [  694.803180]  __blk_mq_run_hw_queue+0xd6/0x180
> [  694.808933]  blk_mq_run_work_fn+0x27/0x30
> [  694.810315]  process_one_work+0x4f1/0xa40
> [  694.813178]  worker_thread+0x67/0x5b0
> [  694.814487]  kthread+0x1cf/0x1f0
> [  694.819134]  ret_from_fork+0x24/0x30
> 
> The code in sg_pool.c that triggers the BUG() statement is as follows:
> 
> int sg_alloc_table_chained(struct sg_table *table, int nents,
>   struct scatterlist *first_chunk)
> {
>   int ret;
> 
>   BUG_ON(!nents);
> [ ... ]
> 
> Bart.

I can reproduce this issue("kernel BUG at lib/sg_pool.c:103") without mp-bvec 
patches,
so looks it isn't the fault of this patchset.

Thanks,
Ming



Re: [Cluster-devel] [dm-devel] [PATCH V15 00/18] block: support multi-page bvec

2019-02-17 Thread Ming Lei
On Sun, Feb 17, 2019 at 09:13:32PM +0800, Ming Lei wrote:
> On Fri, Feb 15, 2019 at 10:59:47AM -0700, Jens Axboe wrote:
> > On 2/15/19 10:14 AM, Bart Van Assche wrote:
> > > On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote:
> > >> On 2/15/19 4:13 AM, Ming Lei wrote:
> > >>> This patchset brings multi-page bvec into block layer:
> > >>
> > >> Applied, thanks Ming. Let's hope it sticks!
> > > 
> > > Hi Jens and Ming,
> > > 
> > > Test nvmeof-mp/002 fails with Jens' for-next branch from this morning.
> > > I have not yet tried to figure out which patch introduced the failure.
> > > Anyway, this is what I see in the kernel log for test nvmeof-mp/002:
> > > 
> > > [  475.611363] BUG: unable to handle kernel NULL pointer dereference at 
> > > 0020
> > > [  475.621188] #PF error: [normal kernel read fault]
> > > [  475.623148] PGD 0 P4D 0  
> > > [  475.624737] Oops:  [#1] PREEMPT SMP KASAN
> > > [  475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: GB 
> > > 5.0.0-rc6-dbg+ #1
> > > [  475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > BIOS 1.10.2-1 04/01/2014
> > > [  475.633855] Workqueue: kblockd blk_mq_requeue_work
> > > [  475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590
> > > [  475.670948] Call Trace:
> > > [  475.693515]  blk_recalc_rq_segments+0x2f/0x50
> > > [  475.695081]  blk_insert_cloned_request+0xbb/0x1c0
> > > [  475.701142]  dm_mq_queue_rq+0x3d1/0x770
> > > [  475.707225]  blk_mq_dispatch_rq_list+0x5fc/0xb10
> > > [  475.717137]  blk_mq_sched_dispatch_requests+0x256/0x300
> > > [  475.721767]  __blk_mq_run_hw_queue+0xd6/0x180
> > > [  475.725920]  __blk_mq_delay_run_hw_queue+0x25c/0x290
> > > [  475.727480]  blk_mq_run_hw_queue+0x119/0x1b0
> > > [  475.732019]  blk_mq_run_hw_queues+0x7b/0xa0
> > > [  475.733468]  blk_mq_requeue_work+0x2cb/0x300
> > > [  475.736473]  process_one_work+0x4f1/0xa40
> > > [  475.739424]  worker_thread+0x67/0x5b0
> > > [  475.741751]  kthread+0x1cf/0x1f0
> > > [  475.746034]  ret_from_fork+0x24/0x30
> > > 
> > > (gdb) list *(__blk_recalc_rq_segments+0xbe)
> > > 0x816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366).
> > > 361  struct bio *bio)
> > > 362 {
> > > 363 struct bio_vec bv, bvprv = { NULL };
> > > 364 int prev = 0;
> > > 365 unsigned int seg_size, nr_phys_segs;
> > > 366 unsigned front_seg_size = bio->bi_seg_front_size;
> > > 367 struct bio *fbio, *bbio;
> > > 368 struct bvec_iter iter;
> > > 369
> > > 370 if (!bio)
> > 
> > Just ran a few tests, and it also seems to cause about a 5% regression
> > in per-core IOPS throughput. Prior to this work, I could get 1620K 4k
> > rand read IOPS out of core, now I'm at ~1535K. The cycler stealer seems
> > to be blk_queue_split() and blk_rq_map_sg().
> 
> Could you share us your test setting?
> 
> I will run null_blk first and see if it can be reproduced.

Looks this performance drop isn't reproduced on null_blk with the following
setting by me:

- modprobe null_blk nr_devices=4 submit_queues=48
- test machine : dual socket, two NUMA nodes, 24cores/socket
- fio script:
fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=48 
--ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/nullb0 
--name=randread --rw=randread

result: 10.7M IOPS(base kernel), 10.6M IOPS(patched kernel)

And if 'bs' is increased to 256k, 512k, 1024k, IOPS improvement can be ~8%
with multi-page bvec patches in above test.

BTW, there isn't cost added to bio_for_each_bvec(), so blk_queue_split() and
blk_rq_map_sg() should be fine. However, bio_for_each_segment_all()
may not be quick as before.


Thanks,
Ming



Re: [Cluster-devel] [dm-devel] [PATCH V15 00/18] block: support multi-page bvec

2019-02-17 Thread Ming Lei
On Fri, Feb 15, 2019 at 10:59:47AM -0700, Jens Axboe wrote:
> On 2/15/19 10:14 AM, Bart Van Assche wrote:
> > On Fri, 2019-02-15 at 08:49 -0700, Jens Axboe wrote:
> >> On 2/15/19 4:13 AM, Ming Lei wrote:
> >>> This patchset brings multi-page bvec into block layer:
> >>
> >> Applied, thanks Ming. Let's hope it sticks!
> > 
> > Hi Jens and Ming,
> > 
> > Test nvmeof-mp/002 fails with Jens' for-next branch from this morning.
> > I have not yet tried to figure out which patch introduced the failure.
> > Anyway, this is what I see in the kernel log for test nvmeof-mp/002:
> > 
> > [  475.611363] BUG: unable to handle kernel NULL pointer dereference at 
> > 0020
> > [  475.621188] #PF error: [normal kernel read fault]
> > [  475.623148] PGD 0 P4D 0  
> > [  475.624737] Oops:  [#1] PREEMPT SMP KASAN
> > [  475.626628] CPU: 1 PID: 277 Comm: kworker/1:1H Tainted: GB   
> >   5.0.0-rc6-dbg+ #1
> > [  475.630232] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 1.10.2-1 04/01/2014
> > [  475.633855] Workqueue: kblockd blk_mq_requeue_work
> > [  475.635777] RIP: 0010:__blk_recalc_rq_segments+0xbe/0x590
> > [  475.670948] Call Trace:
> > [  475.693515]  blk_recalc_rq_segments+0x2f/0x50
> > [  475.695081]  blk_insert_cloned_request+0xbb/0x1c0
> > [  475.701142]  dm_mq_queue_rq+0x3d1/0x770
> > [  475.707225]  blk_mq_dispatch_rq_list+0x5fc/0xb10
> > [  475.717137]  blk_mq_sched_dispatch_requests+0x256/0x300
> > [  475.721767]  __blk_mq_run_hw_queue+0xd6/0x180
> > [  475.725920]  __blk_mq_delay_run_hw_queue+0x25c/0x290
> > [  475.727480]  blk_mq_run_hw_queue+0x119/0x1b0
> > [  475.732019]  blk_mq_run_hw_queues+0x7b/0xa0
> > [  475.733468]  blk_mq_requeue_work+0x2cb/0x300
> > [  475.736473]  process_one_work+0x4f1/0xa40
> > [  475.739424]  worker_thread+0x67/0x5b0
> > [  475.741751]  kthread+0x1cf/0x1f0
> > [  475.746034]  ret_from_fork+0x24/0x30
> > 
> > (gdb) list *(__blk_recalc_rq_segments+0xbe)
> > 0x816a152e is in __blk_recalc_rq_segments (block/blk-merge.c:366).
> > 361  struct bio *bio)
> > 362 {
> > 363 struct bio_vec bv, bvprv = { NULL };
> > 364 int prev = 0;
> > 365 unsigned int seg_size, nr_phys_segs;
> > 366 unsigned front_seg_size = bio->bi_seg_front_size;
> > 367 struct bio *fbio, *bbio;
> > 368 struct bvec_iter iter;
> > 369
> > 370 if (!bio)
> 
> Just ran a few tests, and it also seems to cause about a 5% regression
> in per-core IOPS throughput. Prior to this work, I could get 1620K 4k
> rand read IOPS out of core, now I'm at ~1535K. The cycler stealer seems
> to be blk_queue_split() and blk_rq_map_sg().

Could you share us your test setting?

I will run null_blk first and see if it can be reproduced.

Thanks,
Ming



[Cluster-devel] [PATCH V15 18/18] block: kill BLK_MQ_F_SG_MERGE

2019-02-15 Thread Ming Lei
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c   | 1 -
 drivers/block/loop.c | 2 +-
 drivers/block/nbd.c  | 2 +-
 drivers/block/rbd.c  | 2 +-
 drivers/block/skd_main.c | 1 -
 drivers/block/xen-blkfront.c | 2 +-
 drivers/md/dm-rq.c   | 2 +-
 drivers/mmc/core/queue.c | 3 +--
 drivers/scsi/scsi_lib.c  | 2 +-
 include/linux/blk-mq.h   | 1 -
 10 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 697d6213c82b..c39247c5ddb6 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -249,7 +249,6 @@ static const char *const alloc_policy_name[] = {
 static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(SHOULD_MERGE),
HCTX_FLAG_NAME(TAG_SHARED),
-   HCTX_FLAG_NAME(SG_MERGE),
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 8ef583197414..3d63ad036398 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1937,7 +1937,7 @@ static int loop_add(struct loop_device **l, int i)
lo->tag_set.queue_depth = 128;
lo->tag_set.numa_node = NUMA_NO_NODE;
lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-   lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
lo->tag_set.driver_data = lo;
 
err = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 7c9a949e876b..32a7ba1674b7 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1571,7 +1571,7 @@ static int nbd_dev_add(int index)
nbd->tag_set.numa_node = NUMA_NO_NODE;
nbd->tag_set.cmd_size = sizeof(struct nbd_cmd);
nbd->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-   BLK_MQ_F_SG_MERGE | BLK_MQ_F_BLOCKING;
+   BLK_MQ_F_BLOCKING;
nbd->tag_set.driver_data = nbd;
 
err = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 1e92b61d0bd5..abe9e1c89227 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3988,7 +3988,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
rbd_dev->tag_set.ops = _mq_ops;
rbd_dev->tag_set.queue_depth = rbd_dev->opts->queue_depth;
rbd_dev->tag_set.numa_node = NUMA_NO_NODE;
-   rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
rbd_dev->tag_set.nr_hw_queues = 1;
rbd_dev->tag_set.cmd_size = sizeof(struct work_struct);
 
diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index ab893a7571a2..7d3ad6c22ee5 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -2843,7 +2843,6 @@ static int skd_cons_disk(struct skd_device *skdev)
skdev->sgs_per_request * sizeof(struct scatterlist);
skdev->tag_set.numa_node = NUMA_NO_NODE;
skdev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-   BLK_MQ_F_SG_MERGE |
BLK_ALLOC_POLICY_TO_MQ_FLAG(BLK_TAG_ALLOC_FIFO);
skdev->tag_set.driver_data = skdev;
rc = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 0ed4b200fa58..d43a5677ccbc 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -977,7 +977,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
} else
info->tag_set.queue_depth = BLK_RING_SIZE(info);
info->tag_set.numa_node = NUMA_NO_NODE;
-   info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
info->tag_set.cmd_size = sizeof(struct blkif_req);
info->tag_set.driver_data = info;
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 4eb5f8c56535..b2f8eb2365ee 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -527,7 +527,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, 
struct dm_table *t)
md->tag_set->ops = _mq_ops;
md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
md->tag_set->numa_node = md->numa_node_id;
-   md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
md->tag_set->driver_data = md;
 
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 35cc138b096d..cc19e71c71d4 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -410,8 +410,7 @@ int mmc_init_queue(struct mmc_queue *mq,

[Cluster-devel] [PATCH V15 15/18] block: always define BIO_MAX_PAGES as 256

2019-02-15 Thread Ming Lei
Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
increase BIO_MAX_PAGES for it.

CONFIG_THP_SWAP needs to split one THP into normal pages and adds
them all to one bio. With multipage-bvec, it just takes one bvec to
hold them all.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 include/linux/bio.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 9f77adcfde82..bdd11d4c2f05 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -34,15 +34,7 @@
 #define BIO_BUG_ON
 #endif
 
-#ifdef CONFIG_THP_SWAP
-#if HPAGE_PMD_NR > 256
-#define BIO_MAX_PAGES  HPAGE_PMD_NR
-#else
 #define BIO_MAX_PAGES  256
-#endif
-#else
-#define BIO_MAX_PAGES  256
-#endif
 
 #define bio_prio(bio)  (bio)->bi_ioprio
 #define bio_set_prio(bio, prio)((bio)->bi_ioprio = prio)
-- 
2.9.5



[Cluster-devel] [PATCH V15 17/18] block: kill QUEUE_FLAG_NO_SG_MERGE

2019-02-15 Thread Ming Lei
Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.

Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.

Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().

For another user of blk_recalc_rq_segments():

- run in partial completion branch of blk_update_request, which is an unusual 
case

- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is 
killed
since dm-rq is the only user.

Multi-page bvec is enabled now, not doing S/G merging is rather pointless with 
the
current setup of the I/O path, as it isn't going to save you a significant 
amount
of cycles.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c  | 31 ++-
 block/blk-mq-debugfs.c |  1 -
 block/blk-mq.c |  3 ---
 drivers/md/dm-table.c  | 13 -
 include/linux/blkdev.h |  1 -
 5 files changed, 6 insertions(+), 43 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 1912499b08b7..bed065904677 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -358,8 +358,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio)
 EXPORT_SYMBOL(blk_queue_split);
 
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
-struct bio *bio,
-bool no_sg_merge)
+struct bio *bio)
 {
struct bio_vec bv, bvprv = { NULL };
int prev = 0;
@@ -385,13 +384,6 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
nr_phys_segs = 0;
for_each_bio(bio) {
bio_for_each_bvec(bv, bio, iter) {
-   /*
-* If SG merging is disabled, each bio vector is
-* a segment
-*/
-   if (no_sg_merge)
-   goto new_segment;
-
if (prev) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
@@ -421,27 +413,16 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
 
 void blk_recalc_rq_segments(struct request *rq)
 {
-   bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE,
-   >q->queue_flags);
-
-   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio,
-   no_sg_merge);
+   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);
 }
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-   unsigned short seg_cnt = bio_segments(bio);
-
-   if (test_bit(QUEUE_FLAG_NO_SG_MERGE, >queue_flags) &&
-   (seg_cnt < queue_max_segments(q)))
-   bio->bi_phys_segments = seg_cnt;
-   else {
-   struct bio *nxt = bio->bi_next;
+   struct bio *nxt = bio->bi_next;
 
-   bio->bi_next = NULL;
-   bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio, false);
-   bio->bi_next = nxt;
-   }
+   bio->bi_next = NULL;
+   bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio);
+   bio->bi_next = nxt;
 
bio_set_flag(bio, BIO_SEG_VALID);
 }
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index c782e81db627..697d6213c82b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -128,7 +128,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(SAME_FORCE),
QUEUE_FLAG_NAME(DEAD),
QUEUE_FLAG_NAME(INIT_DONE),
-   QUEUE_FLAG_NAME(NO_SG_MERGE),
QUEUE_FLAG_NAME(POLL),
QUEUE_FLAG_NAME(WC),
QUEUE_FLAG_NAME(FUA),
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 44d471ff8754..fa508ee31742 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2837,9 +2837,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
set->map[HCTX_TYPE_POLL].nr_queues)
blk_queue_flag_set(QUEUE_FLAG_POLL, q);
 
-   if (!(set->flags & BLK_MQ_F_SG_MERGE))
-   blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, q);
-
q->sg_reserved_size = INT_MAX;
 
INIT_DELAYED_WORK(>requeue_work, blk_mq_requeue_work);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 4b1be754cc41..ba9481f1bf3c 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1698,14 +1698,6 @@ static int device_is_not_random(struct dm_target *ti, 
struct dm_dev *dev,
return q && !blk_queue_add_random(q);
 }
 
-static int queue_supports_sg_merge(str

[Cluster-devel] [PATCH V15 16/18] block: document usage of bio iterator helpers

2019-02-15 Thread Ming Lei
Now multi-page bvec is supported, some helpers may return page by
page, meantime some may return segment by segment, this patch
documents the usage.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 Documentation/block/biovecs.txt | 25 +
 1 file changed, 25 insertions(+)

diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 25689584e6e0..ce6eccaf5df7 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -117,3 +117,28 @@ Other implications:
size limitations and the limitations of the underlying devices. Thus
there's no need to define ->merge_bvec_fn() callbacks for individual block
drivers.
+
+Usage of helpers:
+=
+
+* The following helpers whose names have the suffix of "_all" can only be used
+on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers
+shouldn't use them because the bio may have been split before it reached the
+driver.
+
+   bio_for_each_segment_all()
+   bio_first_bvec_all()
+   bio_first_page_all()
+   bio_last_bvec_all()
+
+* The following helpers iterate over single-page segment. The passed 'struct
+bio_vec' will contain a single-page IO vector during the iteration
+
+   bio_for_each_segment()
+   bio_for_each_segment_all()
+
+* The following helpers iterate over multi-page bvec. The passed 'struct
+bio_vec' will contain a multi-page IO vector during the iteration
+
+   bio_for_each_bvec()
+   rq_for_each_bvec()
-- 
2.9.5



[Cluster-devel] [PATCH V15 14/18] block: enable multipage bvecs

2019-02-15 Thread Ming Lei
This patch pulls the trigger for multi-page bvecs.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/bio.c | 22 +++---
 fs/iomap.c  |  4 ++--
 fs/xfs/xfs_aops.c   |  4 ++--
 include/linux/bio.h |  2 +-
 4 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 968b12fea564..83a2dfa417ca 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -753,6 +753,8 @@ EXPORT_SYMBOL(bio_add_pc_page);
  * @page: page to add
  * @len: length of the data to add
  * @off: offset of the data in @page
+ * @same_page: if %true only merge if the new data is in the same physical
+ * page as the last segment of the bio.
  *
  * Try to add the data at @page + @off to the last bvec of @bio.  This is a
  * a useful optimisation for file systems with a block size smaller than the
@@ -761,19 +763,25 @@ EXPORT_SYMBOL(bio_add_pc_page);
  * Return %true on success or %false on failure.
  */
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
-   unsigned int len, unsigned int off)
+   unsigned int len, unsigned int off, bool same_page)
 {
if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
return false;
 
if (bio->bi_vcnt > 0) {
struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
+   phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) +
+   bv->bv_offset + bv->bv_len - 1;
+   phys_addr_t page_addr = page_to_phys(page);
 
-   if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
-   bv->bv_len += len;
-   bio->bi_iter.bi_size += len;
-   return true;
-   }
+   if (vec_end_addr + 1 != page_addr + off)
+   return false;
+   if (same_page && (vec_end_addr & PAGE_MASK) != page_addr)
+   return false;
+
+   bv->bv_len += len;
+   bio->bi_iter.bi_size += len;
+   return true;
}
return false;
 }
@@ -819,7 +827,7 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
 int bio_add_page(struct bio *bio, struct page *page,
 unsigned int len, unsigned int offset)
 {
-   if (!__bio_try_merge_page(bio, page, len, offset)) {
+   if (!__bio_try_merge_page(bio, page, len, offset, false)) {
if (bio_full(bio))
return 0;
__bio_add_page(bio, page, len, offset);
diff --git a/fs/iomap.c b/fs/iomap.c
index af736acd9006..0c350e658b7f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -318,7 +318,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
 */
sector = iomap_sector(iomap, pos);
if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
-   if (__bio_try_merge_page(ctx->bio, page, plen, poff))
+   if (__bio_try_merge_page(ctx->bio, page, plen, poff, true))
goto done;
is_contig = true;
}
@@ -349,7 +349,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
ctx->bio->bi_end_io = iomap_read_end_io;
}
 
-   __bio_add_page(ctx->bio, page, plen, poff);
+   bio_add_page(ctx->bio, page, plen, poff);
 done:
/*
 * Move the caller beyond our range so that it keeps making progress.
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1f1829e506e8..b9fd44168f61 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -616,12 +616,12 @@ xfs_add_to_ioend(
bdev, sector);
}
 
-   if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) {
+   if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff, true)) {
if (iop)
atomic_inc(>write_count);
if (bio_full(wpc->ioend->io_bio))
xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
-   __bio_add_page(wpc->ioend->io_bio, page, len, poff);
+   bio_add_page(wpc->ioend->io_bio, page, len, poff);
}
 
wpc->ioend->io_size += len;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 089370eb84d9..9f77adcfde82 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -441,7 +441,7 @@ extern int bio_add_page(struct bio *, struct page *, 
unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
   unsigned int, unsigned int);
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
-   unsigned int len, unsigned int off);
+   unsigned int len, unsigned int off, bool same_page);
 void __bio_add_page(struct bio *bio, struct page *page,
unsigned int len, unsigned int off);
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
-- 
2.9.5



[Cluster-devel] [PATCH V15 11/18] block: loop: pass multi-page bvec to iov_iter

2019-02-15 Thread Ming Lei
iov_iter is implemented on bvec itererator helpers, so it is safe to pass
multi-page bvec to it, and this way is much more efficient than passing one
page in each bvec.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 drivers/block/loop.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cf5538942834..8ef583197414 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -511,21 +511,22 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 loff_t pos, bool rw)
 {
struct iov_iter iter;
+   struct req_iterator rq_iter;
struct bio_vec *bvec;
struct request *rq = blk_mq_rq_from_pdu(cmd);
struct bio *bio = rq->bio;
struct file *file = lo->lo_backing_file;
+   struct bio_vec tmp;
unsigned int offset;
-   int segments = 0;
+   int nr_bvec = 0;
int ret;
 
+   rq_for_each_bvec(tmp, rq, rq_iter)
+   nr_bvec++;
+
if (rq->bio != rq->biotail) {
-   struct req_iterator iter;
-   struct bio_vec tmp;
 
-   __rq_for_each_bio(bio, rq)
-   segments += bio_segments(bio);
-   bvec = kmalloc_array(segments, sizeof(struct bio_vec),
+   bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
 GFP_NOIO);
if (!bvec)
return -EIO;
@@ -534,10 +535,10 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
/*
 * The bios of the request may be started from the middle of
 * the 'bvec' because of bio splitting, so we can't directly
-* copy bio->bi_iov_vec to new bvec. The rq_for_each_segment
+* copy bio->bi_iov_vec to new bvec. The rq_for_each_bvec
 * API will take care of all details for us.
 */
-   rq_for_each_segment(tmp, rq, iter) {
+   rq_for_each_bvec(tmp, rq, rq_iter) {
*bvec = tmp;
bvec++;
}
@@ -551,11 +552,10 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 */
offset = bio->bi_iter.bi_bvec_done;
bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
-   segments = bio_segments(bio);
}
atomic_set(>ref, 2);
 
-   iov_iter_bvec(, rw, bvec, segments, blk_rq_bytes(rq));
+   iov_iter_bvec(, rw, bvec, nr_bvec, blk_rq_bytes(rq));
iter.iov_offset = offset;
 
cmd->iocb.ki_pos = pos;
-- 
2.9.5



[Cluster-devel] [PATCH V15 10/18] btrfs: use mp_bvec_last_segment to get bio's last page

2019-02-15 Thread Ming Lei
Preparing for supporting multi-page bvec.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 fs/btrfs/extent_io.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dc8ba3ee515d..986ef49b0269 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2697,11 +2697,12 @@ static int __must_check submit_one_bio(struct bio *bio, 
int mirror_num,
 {
blk_status_t ret = 0;
struct bio_vec *bvec = bio_last_bvec_all(bio);
-   struct page *page = bvec->bv_page;
+   struct bio_vec bv;
struct extent_io_tree *tree = bio->bi_private;
u64 start;
 
-   start = page_offset(page) + bvec->bv_offset;
+   mp_bvec_last_segment(bvec, );
+   start = page_offset(bv.bv_page) + bv.bv_offset;
 
bio->bi_private = NULL;
 
-- 
2.9.5



[Cluster-devel] [PATCH V15 13/18] block: allow bio_for_each_segment_all() to iterate over multi-page bvec

2019-02-15 Thread Ming Lei
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all 
bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/bio.c   | 27 ++-
 block/bounce.c|  6 --
 drivers/md/bcache/btree.c |  3 ++-
 drivers/md/dm-crypt.c |  3 ++-
 drivers/md/raid1.c|  3 ++-
 drivers/staging/erofs/data.c  |  3 ++-
 drivers/staging/erofs/unzip_vle.c |  3 ++-
 fs/block_dev.c|  6 --
 fs/btrfs/compression.c|  3 ++-
 fs/btrfs/disk-io.c|  3 ++-
 fs/btrfs/extent_io.c  |  9 ++---
 fs/btrfs/inode.c  |  6 --
 fs/btrfs/raid56.c |  3 ++-
 fs/crypto/bio.c   |  3 ++-
 fs/direct-io.c|  4 +++-
 fs/exofs/ore.c|  3 ++-
 fs/exofs/ore_raid.c   |  3 ++-
 fs/ext4/page-io.c |  3 ++-
 fs/ext4/readpage.c|  3 ++-
 fs/f2fs/data.c|  9 ++---
 fs/gfs2/lops.c|  9 ++---
 fs/gfs2/meta_io.c |  3 ++-
 fs/iomap.c|  6 --
 fs/mpage.c|  3 ++-
 fs/xfs/xfs_aops.c |  5 +++--
 include/linux/bio.h   | 11 +--
 include/linux/bvec.h  | 30 ++
 27 files changed, 127 insertions(+), 46 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4db1008309ed..968b12fea564 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1072,8 +1072,9 @@ static int bio_copy_from_iter(struct bio *bio, struct 
iov_iter *iter)
 {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
ssize_t ret;
 
ret = copy_page_from_iter(bvec->bv_page,
@@ -1103,8 +1104,9 @@ static int bio_copy_to_iter(struct bio *bio, struct 
iov_iter iter)
 {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
ssize_t ret;
 
ret = copy_page_to_iter(bvec->bv_page,
@@ -1126,8 +1128,9 @@ void bio_free_pages(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i)
+   bio_for_each_segment_all(bvec, bio, i, iter_all)
__free_page(bvec->bv_page);
 }
 EXPORT_SYMBOL(bio_free_pages);
@@ -1295,6 +1298,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
struct bio *bio;
int ret;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
if (!iov_iter_count(iter))
return ERR_PTR(-EINVAL);
@@ -1368,7 +1372,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
return bio;
 
  out_unmap:
-   bio_for_each_segment_all(bvec, bio, j) {
+   bio_for_each_segment_all(bvec, bio, j, iter_all) {
put_page(bvec->bv_page);
}
bio_put(bio);
@@ -1379,11 +1383,12 @@ static void __bio_unmap_user(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
/*
 * make sure we dirty pages we wrote to
 */
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
if (bio_data_dir(bio) == READ)
set_page_dirty_lock(bvec->bv_page);
 
@@ -1475,8 +1480,9 @@ static void bio_copy_kern_endio_read(struct bio *bio)
char *p = bio->bi_private;
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
p += bvec->bv_len;
}
@@ -1585,8 +1591,9 @@ void bio_set_pages_dirty(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
if (!PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
}
@@ -1596,8 +1603,9 @@ static void bio_release_pages(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 

[Cluster-devel] [PATCH V15 06/18] block: use bio_for_each_bvec() to compute multi-page bvec count

2019-02-15 Thread Ming Lei
First it is more efficient to use bio_for_each_bvec() in both
blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
many multi-page bvecs there are in the bio.

Secondly once bio_for_each_bvec() is used, the bvec may need to be
splitted because its length can be very longer than max segment size,
so we have to split the big bvec into several segments.

Thirdly when splitting multi-page bvec into segments, the max segment
limit may be reached, so the bio split need to be considered under
this situation too.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 103 +++---
 1 file changed, 83 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index f85d878f313d..4ef56b2d2aa5 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -161,6 +161,73 @@ static inline unsigned get_max_io_size(struct 
request_queue *q,
return sectors;
 }
 
+static unsigned get_max_segment_size(struct request_queue *q,
+unsigned offset)
+{
+   unsigned long mask = queue_segment_boundary(q);
+
+   /* default segment boundary mask means no boundary limit */
+   if (mask == BLK_SEG_BOUNDARY_MASK)
+   return queue_max_segment_size(q);
+
+   return min_t(unsigned long, mask - (mask & offset) + 1,
+queue_max_segment_size(q));
+}
+
+/*
+ * Split the bvec @bv into segments, and update all kinds of
+ * variables.
+ */
+static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
+   unsigned *nsegs, unsigned *last_seg_size,
+   unsigned *front_seg_size, unsigned *sectors)
+{
+   unsigned len = bv->bv_len;
+   unsigned total_len = 0;
+   unsigned new_nsegs = 0, seg_size = 0;
+
+   /*
+* Multi-page bvec may be too big to hold in one segment, so the
+* current bvec has to be splitted as multiple segments.
+*/
+   while (len && new_nsegs + *nsegs < queue_max_segments(q)) {
+   seg_size = get_max_segment_size(q, bv->bv_offset + total_len);
+   seg_size = min(seg_size, len);
+
+   new_nsegs++;
+   total_len += seg_size;
+   len -= seg_size;
+
+   if ((bv->bv_offset + total_len) & queue_virt_boundary(q))
+   break;
+   }
+
+   if (!new_nsegs)
+   return !!len;
+
+   /* update front segment size */
+   if (!*nsegs) {
+   unsigned first_seg_size;
+
+   if (new_nsegs == 1)
+   first_seg_size = get_max_segment_size(q, bv->bv_offset);
+   else
+   first_seg_size = queue_max_segment_size(q);
+
+   if (*front_seg_size < first_seg_size)
+   *front_seg_size = first_seg_size;
+   }
+
+   /* update other varibles */
+   *last_seg_size = seg_size;
+   *nsegs += new_nsegs;
+   if (sectors)
+   *sectors += total_len >> 9;
+
+   /* split in the middle of the bvec if len != 0 */
+   return !!len;
+}
+
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 struct bio *bio,
 struct bio_set *bs,
@@ -174,7 +241,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
 
-   bio_for_each_segment(bv, bio, iter) {
+   bio_for_each_bvec(bv, bio, iter) {
/*
 * If the queue doesn't support SG gaps and adding this
 * offset would create a gap, disallow it.
@@ -189,8 +256,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 */
if (nsegs < queue_max_segments(q) &&
sectors < max_sectors) {
-   nsegs++;
-   sectors = max_sectors;
+   /* split in the middle of bvec */
+   bv.bv_len = (max_sectors - sectors) << 9;
+   bvec_split_segs(q, , ,
+   _size,
+   _seg_size,
+   );
}
goto split;
}
@@ -212,14 +283,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
if (nsegs == queue_max_segments(q))
goto split;
 
-   if (nsegs == 1 && seg_size > front_seg_size)
-   front_seg_size = seg_size;
-
-   nsegs++;
bvprv = bv;

[Cluster-devel] [PATCH V15 12/18] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()

2019-02-15 Thread Ming Lei
bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.

Acked-by: Coly Li 
Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 drivers/md/bcache/util.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 20eddeac1531..62fb917f7a4f 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -270,7 +270,11 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
int i;
struct bio_vec *bv;
 
-   bio_for_each_segment_all(bv, bio, i) {
+   /*
+* This is called on freshly new bio, so it is safe to access the
+* bvec table directly.
+*/
+   for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++, i++) {
bv->bv_page = alloc_page(gfp_mask);
if (!bv->bv_page) {
while (--bv >= bio->bi_io_vec)
-- 
2.9.5



[Cluster-devel] [PATCH V15 09/18] fs/buffer.c: use bvec iterator to truncate the bio

2019-02-15 Thread Ming Lei
Once multi-page bvec is enabled, the last bvec may include more than one
page, this patch use mp_bvec_last_segment() to truncate the bio.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 fs/buffer.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 52d024bfdbc1..817871274c77 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3032,7 +3032,10 @@ void guard_bio_eod(int op, struct bio *bio)
 
/* ..and clear the end of the buffer for reads */
if (op == REQ_OP_READ) {
-   zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
+   struct bio_vec bv;
+
+   mp_bvec_last_segment(bvec, );
+   zero_user(bv.bv_page, bv.bv_offset + bv.bv_len,
truncated_bytes);
}
 }
-- 
2.9.5



[Cluster-devel] [PATCH V15 07/18] block: use bio_for_each_bvec() to map sg

2019-02-15 Thread Ming Lei
It is more efficient to use bio_for_each_bvec() to map sg, meantime
we have to consider splitting multipage bvec as done in blk_bio_segment_split().

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 70 +++
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 4ef56b2d2aa5..1912499b08b7 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -464,6 +464,54 @@ static int blk_phys_contig_segment(struct request_queue 
*q, struct bio *bio,
return biovec_phys_mergeable(q, _bv, _bv);
 }
 
+static struct scatterlist *blk_next_sg(struct scatterlist **sg,
+   struct scatterlist *sglist)
+{
+   if (!*sg)
+   return sglist;
+
+   /*
+* If the driver previously mapped a shorter list, we could see a
+* termination bit prematurely unless it fully inits the sg table
+* on each mapping. We KNOW that there must be more entries here
+* or the driver would be buggy, so force clear the termination bit
+* to avoid doing a full sg_init_table() in drivers for each command.
+*/
+   sg_unmark_end(*sg);
+   return sg_next(*sg);
+}
+
+static unsigned blk_bvec_map_sg(struct request_queue *q,
+   struct bio_vec *bvec, struct scatterlist *sglist,
+   struct scatterlist **sg)
+{
+   unsigned nbytes = bvec->bv_len;
+   unsigned nsegs = 0, total = 0, offset = 0;
+
+   while (nbytes > 0) {
+   unsigned seg_size;
+   struct page *pg;
+   unsigned idx;
+
+   *sg = blk_next_sg(sg, sglist);
+
+   seg_size = get_max_segment_size(q, bvec->bv_offset + total);
+   seg_size = min(nbytes, seg_size);
+
+   offset = (total + bvec->bv_offset) % PAGE_SIZE;
+   idx = (total + bvec->bv_offset) / PAGE_SIZE;
+   pg = nth_page(bvec->bv_page, idx);
+
+   sg_set_page(*sg, pg, seg_size, offset);
+
+   total += seg_size;
+   nbytes -= seg_size;
+   nsegs++;
+   }
+
+   return nsegs;
+}
+
 static inline void
 __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 struct scatterlist *sglist, struct bio_vec *bvprv,
@@ -481,25 +529,7 @@ __blk_segment_map_sg(struct request_queue *q, struct 
bio_vec *bvec,
(*sg)->length += nbytes;
} else {
 new_segment:
-   if (!*sg)
-   *sg = sglist;
-   else {
-   /*
-* If the driver previously mapped a shorter
-* list, we could see a termination bit
-* prematurely unless it fully inits the sg
-* table on each mapping. We KNOW that there
-* must be more entries here or the driver
-* would be buggy, so force clear the
-* termination bit to avoid doing a full
-* sg_init_table() in drivers for each command.
-*/
-   sg_unmark_end(*sg);
-   *sg = sg_next(*sg);
-   }
-
-   sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
-   (*nsegs)++;
+   (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
}
*bvprv = *bvec;
 }
@@ -521,7 +551,7 @@ static int __blk_bios_map_sg(struct request_queue *q, 
struct bio *bio,
int nsegs = 0;
 
for_each_bio(bio)
-   bio_for_each_segment(bvec, bio, iter)
+   bio_for_each_bvec(bvec, bio, iter)
__blk_segment_map_sg(q, , sglist, , sg,
 );
 
-- 
2.9.5



[Cluster-devel] [PATCH V15 08/18] block: introduce mp_bvec_last_segment()

2019-02-15 Thread Ming Lei
BTRFS and guard_bio_eod() need to get the last singlepage segment
from one multipage bvec, so introduce this helper to make them happy.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 0ae729b1c9fe..21f76bad7be2 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -131,4 +131,26 @@ static inline bool bvec_iter_advance(const struct bio_vec 
*bv,
.bi_bvec_done   = 0,\
 }
 
+/*
+ * Get the last single-page segment from the multi-page bvec and store it
+ * in @seg
+ */
+static inline void mp_bvec_last_segment(const struct bio_vec *bvec,
+   struct bio_vec *seg)
+{
+   unsigned total = bvec->bv_offset + bvec->bv_len;
+   unsigned last_page = (total - 1) / PAGE_SIZE;
+
+   seg->bv_page = nth_page(bvec->bv_page, last_page);
+
+   /* the whole segment is inside the last page */
+   if (bvec->bv_offset >= last_page * PAGE_SIZE) {
+   seg->bv_offset = bvec->bv_offset % PAGE_SIZE;
+   seg->bv_len = bvec->bv_len;
+   } else {
+   seg->bv_offset = 0;
+   seg->bv_len = total - last_page * PAGE_SIZE;
+   }
+}
+
 #endif /* __LINUX_BVEC_ITER_H */
-- 
2.9.5



[Cluster-devel] [PATCH V15 05/18] block: introduce bio_for_each_bvec() and rq_for_each_bvec()

2019-02-15 Thread Ming Lei
bio_for_each_bvec() is used for iterating over multi-page bvec for bio
split & merge code.

rq_for_each_bvec() can be used for drivers which may handle the
multi-page bvec directly, so far loop is one perfect use case.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bio.h| 10 ++
 include/linux/blkdev.h |  4 
 2 files changed, 14 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 72b4f7be2106..7ef8a7505c0a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -156,6 +156,16 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 #define bio_for_each_segment(bvl, bio, iter)   \
__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
 
+#define __bio_for_each_bvec(bvl, bio, iter, start) \
+   for (iter = (start);\
+(iter).bi_size &&  \
+   ((bvl = mp_bvec_iter_bvec((bio)->bi_io_vec, (iter))), 1); \
+bio_advance_iter((bio), &(iter), (bvl).bv_len))
+
+/* iterate over multi-page bvec */
+#define bio_for_each_bvec(bvl, bio, iter)  \
+   __bio_for_each_bvec(bvl, bio, iter, (bio)->bi_iter)
+
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
 static inline unsigned bio_segments(struct bio *bio)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 3603270cb82d..b6292d469ea4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -792,6 +792,10 @@ struct req_iterator {
__rq_for_each_bio(_iter.bio, _rq)   \
bio_for_each_segment(bvl, _iter.bio, _iter.iter)
 
+#define rq_for_each_bvec(bvl, _rq, _iter)  \
+   __rq_for_each_bio(_iter.bio, _rq)   \
+   bio_for_each_bvec(bvl, _iter.bio, _iter.iter)
+
 #define rq_iter_last(bvec, _iter)  \
(_iter.bio->bi_next == NULL &&  \
 bio_iter_last(bvec, _iter.iter))
-- 
2.9.5



[Cluster-devel] [PATCH V15 04/18] block: introduce multi-page bvec helpers

2019-02-15 Thread Ming Lei
This patch introduces helpers of 'mp_bvec_iter_*' for multi-page bvec
support.

The introduced helpers treate one bvec as real multi-page segment,
which may include more than one pages.

The existed helpers of bvec_iter_* are interfaces for supporting current
bvec iterator which is thought as single-page by drivers, fs, dm and
etc. These introduced helpers will build single-page bvec in flight, so
this way won't break current bio/bvec users, which needn't any change.

Follows some multi-page bvec background:

- bvecs stored in bio->bi_io_vec is always multi-page style

- bvec(struct bio_vec) represents one physically contiguous I/O
  buffer, now the buffer may include more than one page after
  multi-page bvec is supported, and all these pages represented
  by one bvec is physically contiguous. Before multi-page bvec
  support, at most one page is included in one bvec, we call it
  single-page bvec.

- .bv_page of the bvec points to the 1st page in the multi-page bvec

- .bv_offset of the bvec is the offset of the buffer in the bvec

The effect on the current drivers/filesystem/dm/bcache/...:

- almost everyone supposes that one bvec only includes one single
  page, so we keep the sp interface not changed, for example,
  bio_for_each_segment() still returns single-page bvec

- bio_for_each_segment_all() will return single-page bvec too

- during iterating, iterator variable(struct bvec_iter) is always
  updated in multi-page bvec style, and bvec_iter_advance() is kept
  not changed

- returned(copied) single-page bvec is built in flight by bvec
  helpers from the stored multi-page bvec

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 30 +++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ba0ae40e77c9..0ae729b1c9fe 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -50,16 +51,39 @@ struct bvec_iter {
  */
 #define __bvec_iter_bvec(bvec, iter)   (&(bvec)[(iter).bi_idx])
 
-#define bvec_iter_page(bvec, iter) \
+/* multi-page (mp_bvec) helpers */
+#define mp_bvec_iter_page(bvec, iter)  \
(__bvec_iter_bvec((bvec), (iter))->bv_page)
 
-#define bvec_iter_len(bvec, iter)  \
+#define mp_bvec_iter_len(bvec, iter)   \
min((iter).bi_size, \
__bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
 
-#define bvec_iter_offset(bvec, iter)   \
+#define mp_bvec_iter_offset(bvec, iter)\
(__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
 
+#define mp_bvec_iter_page_idx(bvec, iter)  \
+   (mp_bvec_iter_offset((bvec), (iter)) / PAGE_SIZE)
+
+#define mp_bvec_iter_bvec(bvec, iter)  \
+((struct bio_vec) {\
+   .bv_page= mp_bvec_iter_page((bvec), (iter)),\
+   .bv_len = mp_bvec_iter_len((bvec), (iter)), \
+   .bv_offset  = mp_bvec_iter_offset((bvec), (iter)),  \
+})
+
+/* For building single-page bvec in flight */
+ #define bvec_iter_offset(bvec, iter)  \
+   (mp_bvec_iter_offset((bvec), (iter)) % PAGE_SIZE)
+
+#define bvec_iter_len(bvec, iter)  \
+   min_t(unsigned, mp_bvec_iter_len((bvec), (iter)),   \
+ PAGE_SIZE - bvec_iter_offset((bvec), (iter)))
+
+#define bvec_iter_page(bvec, iter) \
+   nth_page(mp_bvec_iter_page((bvec), (iter)), \
+mp_bvec_iter_page_idx((bvec), (iter)))
+
 #define bvec_iter_bvec(bvec, iter) \
 ((struct bio_vec) {\
.bv_page= bvec_iter_page((bvec), (iter)),   \
-- 
2.9.5



[Cluster-devel] [PATCH V15 03/18] block: remove bvec_iter_rewind()

2019-02-15 Thread Ming Lei
Commit 7759eb23fd980 ("block: remove bio_rewind_iter()") removes
bio_rewind_iter(), then no one uses bvec_iter_rewind() any more,
so remove it.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 24 
 1 file changed, 24 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 02c73c6aa805..ba0ae40e77c9 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -92,30 +92,6 @@ static inline bool bvec_iter_advance(const struct bio_vec 
*bv,
return true;
 }
 
-static inline bool bvec_iter_rewind(const struct bio_vec *bv,
-struct bvec_iter *iter,
-unsigned int bytes)
-{
-   while (bytes) {
-   unsigned len = min(bytes, iter->bi_bvec_done);
-
-   if (iter->bi_bvec_done == 0) {
-   if (WARN_ONCE(iter->bi_idx == 0,
- "Attempted to rewind iter beyond "
- "bvec's boundaries\n")) {
-   return false;
-   }
-   iter->bi_idx--;
-   iter->bi_bvec_done = __bvec_iter_bvec(bv, 
*iter)->bv_len;
-   continue;
-   }
-   bytes -= len;
-   iter->bi_size += len;
-   iter->bi_bvec_done -= len;
-   }
-   return true;
-}
-
 #define for_each_bvec(bvl, bio_vec, iter, start)   \
for (iter = (start);\
 (iter).bi_size &&  \
-- 
2.9.5



[Cluster-devel] [PATCH V15 00/18] block: support multi-page bvec

2019-02-15 Thread Ming Lei
 CC list,
as suggested by Christoph and Dave Chinner
V9:
- fix regression on iomap's sub-pagesize io vec, covered by patch 13
V8:
- remove prepare patches which all are merged to linus tree
- rebase on for-4.21/block
- address comments on V7
- add patches of killing NO_SG_MERGE

V7:
- include Christoph and Mike's bio_clone_bioset() patches, which is
  actually prepare patches for multipage bvec
- address Christoph's comments

V6:
- avoid to introduce lots of renaming, follow Jen's suggestion of
using the name of chunk for multipage io vector
- include Christoph's three prepare patches
- decrease stack usage for using bio_for_each_chunk_segment_all()
- address Kent's comment

V5:
- remove some of prepare patches, which have been merged already
- add bio_clone_seg_bioset() to fix DM's bio clone, which
is introduced by 18a25da84354c6b (dm: ensure bio submission follows
a depth-first tree walk)
- rebase on the latest block for-v4.18

V4:
- rename bio_for_each_segment*() as bio_for_each_page*(), rename
bio_segments() as bio_pages(), rename rq_for_each_segment() as
rq_for_each_pages(), because these helpers never return real
segment, and they always return single page bvec

- introducing segment_for_each_page_all()

- introduce new 
bio_for_each_segment*()/rq_for_each_segment()/bio_segments()
for returning real multipage segment

- rewrite segment_last_page()

- rename bvec iterator helper as suggested by Christoph

- replace comment with applying bio helpers as suggested by Christoph

- document usage of bio iterator helpers

- redefine BIO_MAX_PAGES as 256 to make the biggest bvec table
accommodated in 4K page

- move bio_alloc_pages() into bcache as suggested by Christoph

V3:
- rebase on v4.13-rc3 with for-next of block tree
- run more xfstests: xfs/ext4 over NVMe, Sata, DM(linear),
MD(raid1), and not see regressions triggered
- add Reviewed-by on some btrfs patches
- remove two MD patches because both are merged to linus tree
  already

V2:
- bvec table direct access in raid has been cleaned, so NO_MP
flag is dropped
- rebase on recent Neil Brown's change on bio and bounce code
- reorganize the patchset

V1:
- against v4.10-rc1 and some cleanup in V0 are in -linus already
- handle queue_virt_boundary() in mp bvec change and make NVMe happy
- further BTRFS cleanup
- remove QUEUE_FLAG_SPLIT_MP
- rename for two new helpers of bio_for_each_segment_all()
- fix bounce convertion
- address comments in V0

[1], http://marc.info/?l=linux-kernel=141680246629547=2
[2], https://patchwork.kernel.org/patch/9451523/
[3], http://marc.info/?t=14773544711=1=2
[4], http://marc.info/?l=linux-mm=147745525801433=2
[5], http://marc.info/?t=14956948457=1=2
[6], http://marc.info/?t=14982021534=1=2



Christoph Hellwig (1):
  btrfs: look at bi_size for repair decisions

Ming Lei (17):
  block: don't use bio->bi_vcnt to figure out segment number
  block: remove bvec_iter_rewind()
  block: introduce multi-page bvec helpers
  block: introduce bio_for_each_bvec() and rq_for_each_bvec()
  block: use bio_for_each_bvec() to compute multi-page bvec count
  block: use bio_for_each_bvec() to map sg
  block: introduce mp_bvec_last_segment()
  fs/buffer.c: use bvec iterator to truncate the bio
  btrfs: use mp_bvec_last_segment to get bio's last page
  block: loop: pass multi-page bvec to iov_iter
  bcache: avoid to use bio_for_each_segment_all() in
bch_bio_alloc_pages()
  block: allow bio_for_each_segment_all() to iterate over multi-page
bvec
  block: enable multipage bvecs
  block: always define BIO_MAX_PAGES as 256
  block: document usage of bio iterator helpers
  block: kill QUEUE_FLAG_NO_SG_MERGE
  block: kill BLK_MQ_F_SG_MERGE

 Documentation/block/biovecs.txt   |  25 +
 block/bio.c   |  49 ++---
 block/blk-merge.c | 210 +-
 block/blk-mq-debugfs.c|   2 -
 block/blk-mq.c|   3 -
 block/bounce.c|   6 +-
 drivers/block/loop.c  |  22 ++--
 drivers/block/nbd.c   |   2 +-
 drivers/block/rbd.c   |   2 +-
 drivers/block/skd_main.c  |   1 -
 drivers/block/xen-blkfront.c  |   2 +-
 drivers/md/bcache/btree.c |   3 +-
 drivers/md/bcache/util.c  |   6 +-
 drivers/md/dm-crypt.c |   3 +-
 drivers/md/dm-rq.c|   2 +-
 drivers/md/dm-table.c |  13 ---
 drivers/md/raid1.c|   3 +-
 drivers/mmc/core/queue.c  |   3 +-
 drivers/scsi/scsi_lib.c   |   2 +-
 drivers/staging/erofs/data.c

[Cluster-devel] [PATCH V15 01/18] btrfs: look at bi_size for repair decisions

2019-02-15 Thread Ming Lei
From: Christoph Hellwig 

bio_readpage_error currently uses bi_vcnt to decide if it is worth
retrying an I/O.  But the vector count is mostly an implementation
artifact - it really should figure out if there is more than a
single sector worth retrying.  Use bi_size for that and shift by
PAGE_SHIFT.  This really should be blocks/sectors, but given that
btrfs doesn't support a sector size different from the PAGE_SIZE
using the page size keeps the changes to a minimum.

Reviewed-by: Omar Sandoval 
Reviewed-by: David Sterba 
Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/extent_io.c | 2 +-
 include/linux/bio.h  | 6 --
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 52abe4082680..dc8ba3ee515d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2350,7 +2350,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
int read_mode = 0;
blk_status_t status;
int ret;
-   unsigned failed_bio_pages = bio_pages_all(failed_bio);
+   unsigned failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
 
BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7380b094dcca..72b4f7be2106 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -263,12 +263,6 @@ static inline void bio_get_last_bvec(struct bio *bio, 
struct bio_vec *bv)
bv->bv_len = iter.bi_bvec_done;
 }
 
-static inline unsigned bio_pages_all(struct bio *bio)
-{
-   WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
-   return bio->bi_vcnt;
-}
-
 static inline struct bio_vec *bio_first_bvec_all(struct bio *bio)
 {
WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
-- 
2.9.5



[Cluster-devel] [PATCH V15 02/18] block: don't use bio->bi_vcnt to figure out segment number

2019-02-15 Thread Ming Lei
It is wrong to use bio->bi_vcnt to figure out how many segments
there are in the bio even though CLONED flag isn't set on this bio,
because this bio may be splitted or advanced.

So always use bio_segments() in blk_recount_segments(), and it shouldn't
cause any performance loss now because the physical segment number is figured
out in blk_queue_split() and BIO_SEG_VALID is set meantime since
bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Fixes: 76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max 
segments")
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 71e9ac03f621..f85d878f313d 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -367,13 +367,7 @@ void blk_recalc_rq_segments(struct request *rq)
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-   unsigned short seg_cnt;
-
-   /* estimate segment number by bi_vcnt for non-cloned bio */
-   if (bio_flagged(bio, BIO_CLONED))
-   seg_cnt = bio_segments(bio);
-   else
-   seg_cnt = bio->bi_vcnt;
+   unsigned short seg_cnt = bio_segments(bio);
 
if (test_bit(QUEUE_FLAG_NO_SG_MERGE, >queue_flags) &&
(seg_cnt < queue_max_segments(q)))
-- 
2.9.5



Re: [Cluster-devel] [PATCH V14 00/18] block: support multi-page bvec

2019-01-21 Thread Ming Lei
On Mon, Jan 21, 2019 at 01:43:21AM -0800, Sagi Grimberg wrote:
> 
> > V14:
> > - drop patch(patch 4 in V13) for renaming bvec helpers, as suggested by 
> > Jens
> > - use mp_bvec_* as multi-page bvec helper name
> > - fix one build issue, which is caused by missing one converion of
> > bio_for_each_segment_all in fs/gfs2
> > - fix one 32bit ARCH specific issue caused by segment boundary mask
> > overflow
> 
> Hey Ming,
> 
> So is nvme-tcp also affected here? The only point where I see nvme-tcp
> can be affected is when initializing a bvec iter using bio_segments() as
> everywhere else we use iters which should transparently work..
> 
> I see that loop was converted, does it mean that nvme-tcp needs to
> call something like?
> --
>   bio_for_each_mp_bvec(bv, bio, iter)
>   nr_bvecs++;

bio_for_each_segment()/bio_segments() still works, just not as efficient
as bio_for_each_mp_bvec() given each multi-page bvec(very similar with 
scatterlist)
is returned in each loop.

I don't look at nvme-tcp code yet. But if nvme-tcp supports this way,
it can benefit from bio_for_each_mp_bvec().

Thanks,
Ming



Re: [Cluster-devel] [PATCH V14 00/18] block: support multi-page bvec

2019-01-21 Thread Ming Lei
On Mon, Jan 21, 2019 at 09:38:10AM +0100, Christoph Hellwig wrote:
> On Mon, Jan 21, 2019 at 04:37:12PM +0800, Ming Lei wrote:
> > On Mon, Jan 21, 2019 at 09:22:46AM +0100, Christoph Hellwig wrote:
> > > On Mon, Jan 21, 2019 at 04:17:47PM +0800, Ming Lei wrote:
> > > > V14:
> > > > - drop patch(patch 4 in V13) for renaming bvec helpers, as 
> > > > suggested by Jens
> > > > - use mp_bvec_* as multi-page bvec helper name
> > > 
> > > WTF?  Where is this coming from?  mp is just a nightmare of a name,
> > > and I also didn't see any comments like that.
> > 
> > You should see the recent discussion in which Jens doesn't agree on
> > renaming bvec helper name, so the previous patch of 'block: rename bvec 
> > helpers'
> 
> Where is that discussion?

https://marc.info/?l=linux-next=154777954232109=2


Thanks,
Ming



[Cluster-devel] [PATCH V14 17/18] block: kill QUEUE_FLAG_NO_SG_MERGE

2019-01-21 Thread Ming Lei
Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.

Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.

Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().

For another user of blk_recalc_rq_segments():

- run in partial completion branch of blk_update_request, which is an unusual 
case

- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is 
killed
since dm-rq is the only user.

Multi-page bvec is enabled now, not doing S/G merging is rather pointless with 
the
current setup of the I/O path, as it isn't going to save you a significant 
amount
of cycles.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c  | 31 ++-
 block/blk-mq-debugfs.c |  1 -
 block/blk-mq.c |  3 ---
 drivers/md/dm-table.c  | 13 -
 include/linux/blkdev.h |  1 -
 5 files changed, 6 insertions(+), 43 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8a498f29636f..b990853f6de7 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -358,8 +358,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio)
 EXPORT_SYMBOL(blk_queue_split);
 
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
-struct bio *bio,
-bool no_sg_merge)
+struct bio *bio)
 {
struct bio_vec bv, bvprv = { NULL };
int prev = 0;
@@ -385,13 +384,6 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
nr_phys_segs = 0;
for_each_bio(bio) {
bio_for_each_mp_bvec(bv, bio, iter) {
-   /*
-* If SG merging is disabled, each bio vector is
-* a segment
-*/
-   if (no_sg_merge)
-   goto new_segment;
-
if (prev) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
@@ -421,27 +413,16 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
 
 void blk_recalc_rq_segments(struct request *rq)
 {
-   bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE,
-   >q->queue_flags);
-
-   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio,
-   no_sg_merge);
+   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);
 }
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-   unsigned short seg_cnt = bio_segments(bio);
-
-   if (test_bit(QUEUE_FLAG_NO_SG_MERGE, >queue_flags) &&
-   (seg_cnt < queue_max_segments(q)))
-   bio->bi_phys_segments = seg_cnt;
-   else {
-   struct bio *nxt = bio->bi_next;
+   struct bio *nxt = bio->bi_next;
 
-   bio->bi_next = NULL;
-   bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio, false);
-   bio->bi_next = nxt;
-   }
+   bio->bi_next = NULL;
+   bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio);
+   bio->bi_next = nxt;
 
bio_set_flag(bio, BIO_SEG_VALID);
 }
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 90d68760af08..2f9a11ef5bad 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -128,7 +128,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(SAME_FORCE),
QUEUE_FLAG_NAME(DEAD),
QUEUE_FLAG_NAME(INIT_DONE),
-   QUEUE_FLAG_NAME(NO_SG_MERGE),
QUEUE_FLAG_NAME(POLL),
QUEUE_FLAG_NAME(WC),
QUEUE_FLAG_NAME(FUA),
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ba37b9e15e9..fa45817a7e62 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2829,9 +2829,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
set->map[HCTX_TYPE_POLL].nr_queues)
blk_queue_flag_set(QUEUE_FLAG_POLL, q);
 
-   if (!(set->flags & BLK_MQ_F_SG_MERGE))
-   blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, q);
-
q->sg_reserved_size = INT_MAX;
 
INIT_DELAYED_WORK(>requeue_work, blk_mq_requeue_work);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 4b1be754cc41..ba9481f1bf3c 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1698,14 +1698,6 @@ static int device_is_not_random(struct dm_target *ti, 
struct dm_dev *dev,
return q && !blk_queue_add_random(q);
 }
 
-static int queue_supports_sg_merge(str

[Cluster-devel] [PATCH V14 18/18] block: kill BLK_MQ_F_SG_MERGE

2019-01-21 Thread Ming Lei
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c   | 1 -
 drivers/block/loop.c | 2 +-
 drivers/block/nbd.c  | 2 +-
 drivers/block/rbd.c  | 2 +-
 drivers/block/skd_main.c | 1 -
 drivers/block/xen-blkfront.c | 2 +-
 drivers/md/dm-rq.c   | 2 +-
 drivers/mmc/core/queue.c | 3 +--
 drivers/scsi/scsi_lib.c  | 2 +-
 include/linux/blk-mq.h   | 1 -
 10 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 2f9a11ef5bad..2ba0aa05ce13 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -250,7 +250,6 @@ static const char *const alloc_policy_name[] = {
 static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(SHOULD_MERGE),
HCTX_FLAG_NAME(TAG_SHARED),
-   HCTX_FLAG_NAME(SG_MERGE),
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 168a151aba49..714924cfb050 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1937,7 +1937,7 @@ static int loop_add(struct loop_device **l, int i)
lo->tag_set.queue_depth = 128;
lo->tag_set.numa_node = NUMA_NO_NODE;
lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-   lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
lo->tag_set.driver_data = lo;
 
err = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 08696f5f00bb..999c94de78e5 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1570,7 +1570,7 @@ static int nbd_dev_add(int index)
nbd->tag_set.numa_node = NUMA_NO_NODE;
nbd->tag_set.cmd_size = sizeof(struct nbd_cmd);
nbd->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-   BLK_MQ_F_SG_MERGE | BLK_MQ_F_BLOCKING;
+   BLK_MQ_F_BLOCKING;
nbd->tag_set.driver_data = nbd;
 
err = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 1e92b61d0bd5..abe9e1c89227 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3988,7 +3988,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
rbd_dev->tag_set.ops = _mq_ops;
rbd_dev->tag_set.queue_depth = rbd_dev->opts->queue_depth;
rbd_dev->tag_set.numa_node = NUMA_NO_NODE;
-   rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
rbd_dev->tag_set.nr_hw_queues = 1;
rbd_dev->tag_set.cmd_size = sizeof(struct work_struct);
 
diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index ab893a7571a2..7d3ad6c22ee5 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -2843,7 +2843,6 @@ static int skd_cons_disk(struct skd_device *skdev)
skdev->sgs_per_request * sizeof(struct scatterlist);
skdev->tag_set.numa_node = NUMA_NO_NODE;
skdev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-   BLK_MQ_F_SG_MERGE |
BLK_ALLOC_POLICY_TO_MQ_FLAG(BLK_TAG_ALLOC_FIFO);
skdev->tag_set.driver_data = skdev;
rc = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 0ed4b200fa58..d43a5677ccbc 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -977,7 +977,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
} else
info->tag_set.queue_depth = BLK_RING_SIZE(info);
info->tag_set.numa_node = NUMA_NO_NODE;
-   info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
info->tag_set.cmd_size = sizeof(struct blkif_req);
info->tag_set.driver_data = info;
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 4eb5f8c56535..b2f8eb2365ee 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -527,7 +527,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, 
struct dm_table *t)
md->tag_set->ops = _mq_ops;
md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
md->tag_set->numa_node = md->numa_node_id;
-   md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
md->tag_set->driver_data = md;
 
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 35cc138b096d..cc19e71c71d4 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -410,8 +410,7 @@ int mmc_init_queue(struct mmc_queue *mq,

[Cluster-devel] [PATCH V14 16/18] block: document usage of bio iterator helpers

2019-01-21 Thread Ming Lei
Now multi-page bvec is supported, some helpers may return page by
page, meantime some may return segment by segment, this patch
documents the usage.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 Documentation/block/biovecs.txt | 25 +
 1 file changed, 25 insertions(+)

diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 25689584e6e0..4a466bcb9611 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -117,3 +117,28 @@ Other implications:
size limitations and the limitations of the underlying devices. Thus
there's no need to define ->merge_bvec_fn() callbacks for individual block
drivers.
+
+Usage of helpers:
+=
+
+* The following helpers whose names have the suffix of "_all" can only be used
+on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers
+shouldn't use them because the bio may have been split before it reached the
+driver.
+
+   bio_for_each_segment_all()
+   bio_first_bvec_all()
+   bio_first_page_all()
+   bio_last_bvec_all()
+
+* The following helpers iterate over single-page segment. The passed 'struct
+bio_vec' will contain a single-page IO vector during the iteration
+
+   bio_for_each_segment()
+   bio_for_each_segment_all()
+
+* The following helpers iterate over multi-page bvec. The passed 'struct
+bio_vec' will contain a multi-page IO vector during the iteration
+
+   bio_for_each_mp_bvec()
+   rq_for_each_mp_bvec()
-- 
2.9.5



[Cluster-devel] [PATCH V14 15/18] block: always define BIO_MAX_PAGES as 256

2019-01-21 Thread Ming Lei
Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
increase BIO_MAX_PAGES for it.

CONFIG_THP_SWAP needs to split one THP into normal pages and adds
them all to one bio. With multipage-bvec, it just takes one bvec to
hold them all.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 include/linux/bio.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index af288f6e8ab0..1d279a6ae737 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -34,15 +34,7 @@
 #define BIO_BUG_ON
 #endif
 
-#ifdef CONFIG_THP_SWAP
-#if HPAGE_PMD_NR > 256
-#define BIO_MAX_PAGES  HPAGE_PMD_NR
-#else
 #define BIO_MAX_PAGES  256
-#endif
-#else
-#define BIO_MAX_PAGES  256
-#endif
 
 #define bio_prio(bio)  (bio)->bi_ioprio
 #define bio_set_prio(bio, prio)((bio)->bi_ioprio = prio)
-- 
2.9.5



[Cluster-devel] [PATCH V14 12/18] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()

2019-01-21 Thread Ming Lei
bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.

Acked-by: Coly Li 
Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 drivers/md/bcache/util.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 20eddeac1531..62fb917f7a4f 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -270,7 +270,11 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
int i;
struct bio_vec *bv;
 
-   bio_for_each_segment_all(bv, bio, i) {
+   /*
+* This is called on freshly new bio, so it is safe to access the
+* bvec table directly.
+*/
+   for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++, i++) {
bv->bv_page = alloc_page(gfp_mask);
if (!bv->bv_page) {
while (--bv >= bio->bi_io_vec)
-- 
2.9.5



[Cluster-devel] [PATCH V14 11/18] block: loop: pass multi-page bvec to iov_iter

2019-01-21 Thread Ming Lei
iov_iter is implemented on bvec itererator helpers, so it is safe to pass
multi-page bvec to it, and this way is much more efficient than passing one
page in each bvec.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 drivers/block/loop.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cf5538942834..168a151aba49 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -511,21 +511,22 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 loff_t pos, bool rw)
 {
struct iov_iter iter;
+   struct req_iterator rq_iter;
struct bio_vec *bvec;
struct request *rq = blk_mq_rq_from_pdu(cmd);
struct bio *bio = rq->bio;
struct file *file = lo->lo_backing_file;
+   struct bio_vec tmp;
unsigned int offset;
-   int segments = 0;
+   int nr_bvec = 0;
int ret;
 
+   rq_for_each_mp_bvec(tmp, rq, rq_iter)
+   nr_bvec++;
+
if (rq->bio != rq->biotail) {
-   struct req_iterator iter;
-   struct bio_vec tmp;
 
-   __rq_for_each_bio(bio, rq)
-   segments += bio_segments(bio);
-   bvec = kmalloc_array(segments, sizeof(struct bio_vec),
+   bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
 GFP_NOIO);
if (!bvec)
return -EIO;
@@ -534,10 +535,10 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
/*
 * The bios of the request may be started from the middle of
 * the 'bvec' because of bio splitting, so we can't directly
-* copy bio->bi_iov_vec to new bvec. The rq_for_each_segment
+* copy bio->bi_iov_vec to new bvec. The rq_for_each_mp_bvec
 * API will take care of all details for us.
 */
-   rq_for_each_segment(tmp, rq, iter) {
+   rq_for_each_mp_bvec(tmp, rq, rq_iter) {
*bvec = tmp;
bvec++;
}
@@ -551,11 +552,10 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 */
offset = bio->bi_iter.bi_bvec_done;
bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
-   segments = bio_segments(bio);
}
atomic_set(>ref, 2);
 
-   iov_iter_bvec(, rw, bvec, segments, blk_rq_bytes(rq));
+   iov_iter_bvec(, rw, bvec, nr_bvec, blk_rq_bytes(rq));
iter.iov_offset = offset;
 
cmd->iocb.ki_pos = pos;
-- 
2.9.5



[Cluster-devel] [PATCH V14 09/18] fs/buffer.c: use bvec iterator to truncate the bio

2019-01-21 Thread Ming Lei
Once multi-page bvec is enabled, the last bvec may include more than one
page, this patch use mp_bvec_last_segment() to truncate the bio.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 fs/buffer.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 52d024bfdbc1..817871274c77 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3032,7 +3032,10 @@ void guard_bio_eod(int op, struct bio *bio)
 
/* ..and clear the end of the buffer for reads */
if (op == REQ_OP_READ) {
-   zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
+   struct bio_vec bv;
+
+   mp_bvec_last_segment(bvec, );
+   zero_user(bv.bv_page, bv.bv_offset + bv.bv_len,
truncated_bytes);
}
 }
-- 
2.9.5



[Cluster-devel] [PATCH V14 14/18] block: enable multipage bvecs

2019-01-21 Thread Ming Lei
This patch pulls the trigger for multi-page bvecs.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/bio.c | 22 +++---
 fs/iomap.c  |  4 ++--
 fs/xfs/xfs_aops.c   |  4 ++--
 include/linux/bio.h |  2 +-
 4 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 968b12fea564..83a2dfa417ca 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -753,6 +753,8 @@ EXPORT_SYMBOL(bio_add_pc_page);
  * @page: page to add
  * @len: length of the data to add
  * @off: offset of the data in @page
+ * @same_page: if %true only merge if the new data is in the same physical
+ * page as the last segment of the bio.
  *
  * Try to add the data at @page + @off to the last bvec of @bio.  This is a
  * a useful optimisation for file systems with a block size smaller than the
@@ -761,19 +763,25 @@ EXPORT_SYMBOL(bio_add_pc_page);
  * Return %true on success or %false on failure.
  */
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
-   unsigned int len, unsigned int off)
+   unsigned int len, unsigned int off, bool same_page)
 {
if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
return false;
 
if (bio->bi_vcnt > 0) {
struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
+   phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) +
+   bv->bv_offset + bv->bv_len - 1;
+   phys_addr_t page_addr = page_to_phys(page);
 
-   if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
-   bv->bv_len += len;
-   bio->bi_iter.bi_size += len;
-   return true;
-   }
+   if (vec_end_addr + 1 != page_addr + off)
+   return false;
+   if (same_page && (vec_end_addr & PAGE_MASK) != page_addr)
+   return false;
+
+   bv->bv_len += len;
+   bio->bi_iter.bi_size += len;
+   return true;
}
return false;
 }
@@ -819,7 +827,7 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
 int bio_add_page(struct bio *bio, struct page *page,
 unsigned int len, unsigned int offset)
 {
-   if (!__bio_try_merge_page(bio, page, len, offset)) {
+   if (!__bio_try_merge_page(bio, page, len, offset, false)) {
if (bio_full(bio))
return 0;
__bio_add_page(bio, page, len, offset);
diff --git a/fs/iomap.c b/fs/iomap.c
index af736acd9006..0c350e658b7f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -318,7 +318,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
 */
sector = iomap_sector(iomap, pos);
if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
-   if (__bio_try_merge_page(ctx->bio, page, plen, poff))
+   if (__bio_try_merge_page(ctx->bio, page, plen, poff, true))
goto done;
is_contig = true;
}
@@ -349,7 +349,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
ctx->bio->bi_end_io = iomap_read_end_io;
}
 
-   __bio_add_page(ctx->bio, page, plen, poff);
+   bio_add_page(ctx->bio, page, plen, poff);
 done:
/*
 * Move the caller beyond our range so that it keeps making progress.
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1f1829e506e8..b9fd44168f61 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -616,12 +616,12 @@ xfs_add_to_ioend(
bdev, sector);
}
 
-   if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) {
+   if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff, true)) {
if (iop)
atomic_inc(>write_count);
if (bio_full(wpc->ioend->io_bio))
xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
-   __bio_add_page(wpc->ioend->io_bio, page, len, poff);
+   bio_add_page(wpc->ioend->io_bio, page, len, poff);
}
 
wpc->ioend->io_size += len;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index e6a6f3d78afd..af288f6e8ab0 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -441,7 +441,7 @@ extern int bio_add_page(struct bio *, struct page *, 
unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
   unsigned int, unsigned int);
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
-   unsigned int len, unsigned int off);
+   unsigned int len, unsigned int off, bool same_page);
 void __bio_add_page(struct bio *bio, struct page *page,
unsigned int len, unsigned int off);
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
-- 
2.9.5



[Cluster-devel] [PATCH V14 13/18] block: allow bio_for_each_segment_all() to iterate over multi-page bvec

2019-01-21 Thread Ming Lei
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all 
bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/bio.c   | 27 ++-
 block/bounce.c|  6 --
 drivers/md/bcache/btree.c |  3 ++-
 drivers/md/dm-crypt.c |  3 ++-
 drivers/md/raid1.c|  3 ++-
 drivers/staging/erofs/data.c  |  3 ++-
 drivers/staging/erofs/unzip_vle.c |  3 ++-
 fs/block_dev.c|  6 --
 fs/btrfs/compression.c|  3 ++-
 fs/btrfs/disk-io.c|  3 ++-
 fs/btrfs/extent_io.c  |  9 ++---
 fs/btrfs/inode.c  |  6 --
 fs/btrfs/raid56.c |  3 ++-
 fs/crypto/bio.c   |  3 ++-
 fs/direct-io.c|  4 +++-
 fs/exofs/ore.c|  3 ++-
 fs/exofs/ore_raid.c   |  3 ++-
 fs/ext4/page-io.c |  3 ++-
 fs/ext4/readpage.c|  3 ++-
 fs/f2fs/data.c|  9 ++---
 fs/gfs2/lops.c|  9 ++---
 fs/gfs2/meta_io.c |  3 ++-
 fs/iomap.c|  6 --
 fs/mpage.c|  3 ++-
 fs/xfs/xfs_aops.c |  5 +++--
 include/linux/bio.h   | 11 +--
 include/linux/bvec.h  | 30 ++
 27 files changed, 127 insertions(+), 46 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4db1008309ed..968b12fea564 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1072,8 +1072,9 @@ static int bio_copy_from_iter(struct bio *bio, struct 
iov_iter *iter)
 {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
ssize_t ret;
 
ret = copy_page_from_iter(bvec->bv_page,
@@ -1103,8 +1104,9 @@ static int bio_copy_to_iter(struct bio *bio, struct 
iov_iter iter)
 {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
ssize_t ret;
 
ret = copy_page_to_iter(bvec->bv_page,
@@ -1126,8 +1128,9 @@ void bio_free_pages(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i)
+   bio_for_each_segment_all(bvec, bio, i, iter_all)
__free_page(bvec->bv_page);
 }
 EXPORT_SYMBOL(bio_free_pages);
@@ -1295,6 +1298,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
struct bio *bio;
int ret;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
if (!iov_iter_count(iter))
return ERR_PTR(-EINVAL);
@@ -1368,7 +1372,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
return bio;
 
  out_unmap:
-   bio_for_each_segment_all(bvec, bio, j) {
+   bio_for_each_segment_all(bvec, bio, j, iter_all) {
put_page(bvec->bv_page);
}
bio_put(bio);
@@ -1379,11 +1383,12 @@ static void __bio_unmap_user(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
/*
 * make sure we dirty pages we wrote to
 */
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
if (bio_data_dir(bio) == READ)
set_page_dirty_lock(bvec->bv_page);
 
@@ -1475,8 +1480,9 @@ static void bio_copy_kern_endio_read(struct bio *bio)
char *p = bio->bi_private;
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
p += bvec->bv_len;
}
@@ -1585,8 +1591,9 @@ void bio_set_pages_dirty(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
if (!PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
}
@@ -1596,8 +1603,9 @@ static void bio_release_pages(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 

[Cluster-devel] [PATCH V14 10/18] btrfs: use mp_bvec_last_segment to get bio's last page

2019-01-21 Thread Ming Lei
Preparing for supporting multi-page bvec.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 fs/btrfs/extent_io.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dc8ba3ee515d..986ef49b0269 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2697,11 +2697,12 @@ static int __must_check submit_one_bio(struct bio *bio, 
int mirror_num,
 {
blk_status_t ret = 0;
struct bio_vec *bvec = bio_last_bvec_all(bio);
-   struct page *page = bvec->bv_page;
+   struct bio_vec bv;
struct extent_io_tree *tree = bio->bi_private;
u64 start;
 
-   start = page_offset(page) + bvec->bv_offset;
+   mp_bvec_last_segment(bvec, );
+   start = page_offset(bv.bv_page) + bv.bv_offset;
 
bio->bi_private = NULL;
 
-- 
2.9.5



[Cluster-devel] [PATCH V14 07/18] block: use bio_for_each_mp_bvec() to map sg

2019-01-21 Thread Ming Lei
It is more efficient to use bio_for_each_mp_bvec() to map sg, meantime
we have to consider splitting multipage bvec as done in blk_bio_segment_split().

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 70 +++
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2dfc30d8bc77..8a498f29636f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -464,6 +464,54 @@ static int blk_phys_contig_segment(struct request_queue 
*q, struct bio *bio,
return biovec_phys_mergeable(q, _bv, _bv);
 }
 
+static struct scatterlist *blk_next_sg(struct scatterlist **sg,
+   struct scatterlist *sglist)
+{
+   if (!*sg)
+   return sglist;
+
+   /*
+* If the driver previously mapped a shorter list, we could see a
+* termination bit prematurely unless it fully inits the sg table
+* on each mapping. We KNOW that there must be more entries here
+* or the driver would be buggy, so force clear the termination bit
+* to avoid doing a full sg_init_table() in drivers for each command.
+*/
+   sg_unmark_end(*sg);
+   return sg_next(*sg);
+}
+
+static unsigned blk_bvec_map_sg(struct request_queue *q,
+   struct bio_vec *bvec, struct scatterlist *sglist,
+   struct scatterlist **sg)
+{
+   unsigned nbytes = bvec->bv_len;
+   unsigned nsegs = 0, total = 0, offset = 0;
+
+   while (nbytes > 0) {
+   unsigned seg_size;
+   struct page *pg;
+   unsigned idx;
+
+   *sg = blk_next_sg(sg, sglist);
+
+   seg_size = get_max_segment_size(q, bvec->bv_offset + total);
+   seg_size = min(nbytes, seg_size);
+
+   offset = (total + bvec->bv_offset) % PAGE_SIZE;
+   idx = (total + bvec->bv_offset) / PAGE_SIZE;
+   pg = nth_page(bvec->bv_page, idx);
+
+   sg_set_page(*sg, pg, seg_size, offset);
+
+   total += seg_size;
+   nbytes -= seg_size;
+   nsegs++;
+   }
+
+   return nsegs;
+}
+
 static inline void
 __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 struct scatterlist *sglist, struct bio_vec *bvprv,
@@ -481,25 +529,7 @@ __blk_segment_map_sg(struct request_queue *q, struct 
bio_vec *bvec,
(*sg)->length += nbytes;
} else {
 new_segment:
-   if (!*sg)
-   *sg = sglist;
-   else {
-   /*
-* If the driver previously mapped a shorter
-* list, we could see a termination bit
-* prematurely unless it fully inits the sg
-* table on each mapping. We KNOW that there
-* must be more entries here or the driver
-* would be buggy, so force clear the
-* termination bit to avoid doing a full
-* sg_init_table() in drivers for each command.
-*/
-   sg_unmark_end(*sg);
-   *sg = sg_next(*sg);
-   }
-
-   sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
-   (*nsegs)++;
+   (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
}
*bvprv = *bvec;
 }
@@ -521,7 +551,7 @@ static int __blk_bios_map_sg(struct request_queue *q, 
struct bio *bio,
int nsegs = 0;
 
for_each_bio(bio)
-   bio_for_each_segment(bvec, bio, iter)
+   bio_for_each_mp_bvec(bvec, bio, iter)
__blk_segment_map_sg(q, , sglist, , sg,
 );
 
-- 
2.9.5



[Cluster-devel] [PATCH V14 06/18] block: use bio_for_each_mp_bvec() to compute multi-page bvec count

2019-01-21 Thread Ming Lei
First it is more efficient to use bio_for_each_mp_bvec() in both
blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
many multi-page bvecs there are in the bio.

Secondly once bio_for_each_mp_bvec() is used, the bvec may need to be
splitted because its length can be very longer than max segment size,
so we have to split the big bvec into several segments.

Thirdly when splitting multi-page bvec into segments, the max segment
limit may be reached, so the bio split need to be considered under
this situation too.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 103 +++---
 1 file changed, 83 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index f85d878f313d..2dfc30d8bc77 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -161,6 +161,73 @@ static inline unsigned get_max_io_size(struct 
request_queue *q,
return sectors;
 }
 
+static unsigned get_max_segment_size(struct request_queue *q,
+unsigned offset)
+{
+   unsigned long mask = queue_segment_boundary(q);
+
+   /* default segment boundary mask means no boundary limit */
+   if (mask == BLK_SEG_BOUNDARY_MASK)
+   return queue_max_segment_size(q);
+
+   return min_t(unsigned long, mask - (mask & offset) + 1,
+queue_max_segment_size(q));
+}
+
+/*
+ * Split the bvec @bv into segments, and update all kinds of
+ * variables.
+ */
+static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
+   unsigned *nsegs, unsigned *last_seg_size,
+   unsigned *front_seg_size, unsigned *sectors)
+{
+   unsigned len = bv->bv_len;
+   unsigned total_len = 0;
+   unsigned new_nsegs = 0, seg_size = 0;
+
+   /*
+* Multi-page bvec may be too big to hold in one segment, so the
+* current bvec has to be splitted as multiple segments.
+*/
+   while (len && new_nsegs + *nsegs < queue_max_segments(q)) {
+   seg_size = get_max_segment_size(q, bv->bv_offset + total_len);
+   seg_size = min(seg_size, len);
+
+   new_nsegs++;
+   total_len += seg_size;
+   len -= seg_size;
+
+   if ((bv->bv_offset + total_len) & queue_virt_boundary(q))
+   break;
+   }
+
+   if (!new_nsegs)
+   return !!len;
+
+   /* update front segment size */
+   if (!*nsegs) {
+   unsigned first_seg_size;
+
+   if (new_nsegs == 1)
+   first_seg_size = get_max_segment_size(q, bv->bv_offset);
+   else
+   first_seg_size = queue_max_segment_size(q);
+
+   if (*front_seg_size < first_seg_size)
+   *front_seg_size = first_seg_size;
+   }
+
+   /* update other varibles */
+   *last_seg_size = seg_size;
+   *nsegs += new_nsegs;
+   if (sectors)
+   *sectors += total_len >> 9;
+
+   /* split in the middle of the bvec if len != 0 */
+   return !!len;
+}
+
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 struct bio *bio,
 struct bio_set *bs,
@@ -174,7 +241,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
 
-   bio_for_each_segment(bv, bio, iter) {
+   bio_for_each_mp_bvec(bv, bio, iter) {
/*
 * If the queue doesn't support SG gaps and adding this
 * offset would create a gap, disallow it.
@@ -189,8 +256,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 */
if (nsegs < queue_max_segments(q) &&
sectors < max_sectors) {
-   nsegs++;
-   sectors = max_sectors;
+   /* split in the middle of bvec */
+   bv.bv_len = (max_sectors - sectors) << 9;
+   bvec_split_segs(q, , ,
+   _size,
+   _seg_size,
+   );
}
goto split;
}
@@ -212,14 +283,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
if (nsegs == queue_max_segments(q))
goto split;
 
-   if (nsegs == 1 && seg_size > front_seg_size)
-   front_seg_size = seg_size;
-
-   nsegs++;
bvp

[Cluster-devel] [PATCH V14 08/18] block: introduce mp_bvec_last_segment()

2019-01-21 Thread Ming Lei
BTRFS and guard_bio_eod() need to get the last singlepage segment
from one multipage bvec, so introduce this helper to make them happy.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 0ae729b1c9fe..21f76bad7be2 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -131,4 +131,26 @@ static inline bool bvec_iter_advance(const struct bio_vec 
*bv,
.bi_bvec_done   = 0,\
 }
 
+/*
+ * Get the last single-page segment from the multi-page bvec and store it
+ * in @seg
+ */
+static inline void mp_bvec_last_segment(const struct bio_vec *bvec,
+   struct bio_vec *seg)
+{
+   unsigned total = bvec->bv_offset + bvec->bv_len;
+   unsigned last_page = (total - 1) / PAGE_SIZE;
+
+   seg->bv_page = nth_page(bvec->bv_page, last_page);
+
+   /* the whole segment is inside the last page */
+   if (bvec->bv_offset >= last_page * PAGE_SIZE) {
+   seg->bv_offset = bvec->bv_offset % PAGE_SIZE;
+   seg->bv_len = bvec->bv_len;
+   } else {
+   seg->bv_offset = 0;
+   seg->bv_len = total - last_page * PAGE_SIZE;
+   }
+}
+
 #endif /* __LINUX_BVEC_ITER_H */
-- 
2.9.5



[Cluster-devel] [PATCH V14 04/18] block: introduce multi-page bvec helpers

2019-01-21 Thread Ming Lei
This patch introduces helpers of 'mp_bvec_iter_*' for multi-page bvec
support.

The introduced helpers treate one bvec as real multi-page segment,
which may include more than one pages.

The existed helpers of bvec_iter_* are interfaces for supporting current
bvec iterator which is thought as single-page by drivers, fs, dm and
etc. These introduced helpers will build single-page bvec in flight, so
this way won't break current bio/bvec users, which needn't any change.

Follows some multi-page bvec background:

- bvecs stored in bio->bi_io_vec is always multi-page style

- bvec(struct bio_vec) represents one physically contiguous I/O
  buffer, now the buffer may include more than one page after
  multi-page bvec is supported, and all these pages represented
  by one bvec is physically contiguous. Before multi-page bvec
  support, at most one page is included in one bvec, we call it
  single-page bvec.

- .bv_page of the bvec points to the 1st page in the multi-page bvec

- .bv_offset of the bvec is the offset of the buffer in the bvec

The effect on the current drivers/filesystem/dm/bcache/...:

- almost everyone supposes that one bvec only includes one single
  page, so we keep the sp interface not changed, for example,
  bio_for_each_segment() still returns single-page bvec

- bio_for_each_segment_all() will return single-page bvec too

- during iterating, iterator variable(struct bvec_iter) is always
  updated in multi-page bvec style, and bvec_iter_advance() is kept
  not changed

- returned(copied) single-page bvec is built in flight by bvec
  helpers from the stored multi-page bvec

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 30 +++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ba0ae40e77c9..0ae729b1c9fe 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -50,16 +51,39 @@ struct bvec_iter {
  */
 #define __bvec_iter_bvec(bvec, iter)   (&(bvec)[(iter).bi_idx])
 
-#define bvec_iter_page(bvec, iter) \
+/* multi-page (mp_bvec) helpers */
+#define mp_bvec_iter_page(bvec, iter)  \
(__bvec_iter_bvec((bvec), (iter))->bv_page)
 
-#define bvec_iter_len(bvec, iter)  \
+#define mp_bvec_iter_len(bvec, iter)   \
min((iter).bi_size, \
__bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
 
-#define bvec_iter_offset(bvec, iter)   \
+#define mp_bvec_iter_offset(bvec, iter)\
(__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
 
+#define mp_bvec_iter_page_idx(bvec, iter)  \
+   (mp_bvec_iter_offset((bvec), (iter)) / PAGE_SIZE)
+
+#define mp_bvec_iter_bvec(bvec, iter)  \
+((struct bio_vec) {\
+   .bv_page= mp_bvec_iter_page((bvec), (iter)),\
+   .bv_len = mp_bvec_iter_len((bvec), (iter)), \
+   .bv_offset  = mp_bvec_iter_offset((bvec), (iter)),  \
+})
+
+/* For building single-page bvec in flight */
+ #define bvec_iter_offset(bvec, iter)  \
+   (mp_bvec_iter_offset((bvec), (iter)) % PAGE_SIZE)
+
+#define bvec_iter_len(bvec, iter)  \
+   min_t(unsigned, mp_bvec_iter_len((bvec), (iter)),   \
+ PAGE_SIZE - bvec_iter_offset((bvec), (iter)))
+
+#define bvec_iter_page(bvec, iter) \
+   nth_page(mp_bvec_iter_page((bvec), (iter)), \
+mp_bvec_iter_page_idx((bvec), (iter)))
+
 #define bvec_iter_bvec(bvec, iter) \
 ((struct bio_vec) {\
.bv_page= bvec_iter_page((bvec), (iter)),   \
-- 
2.9.5



[Cluster-devel] [PATCH V14 03/18] block: remove bvec_iter_rewind()

2019-01-21 Thread Ming Lei
Commit 7759eb23fd980 ("block: remove bio_rewind_iter()") removes
bio_rewind_iter(), then no one uses bvec_iter_rewind() any more,
so remove it.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 24 
 1 file changed, 24 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 02c73c6aa805..ba0ae40e77c9 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -92,30 +92,6 @@ static inline bool bvec_iter_advance(const struct bio_vec 
*bv,
return true;
 }
 
-static inline bool bvec_iter_rewind(const struct bio_vec *bv,
-struct bvec_iter *iter,
-unsigned int bytes)
-{
-   while (bytes) {
-   unsigned len = min(bytes, iter->bi_bvec_done);
-
-   if (iter->bi_bvec_done == 0) {
-   if (WARN_ONCE(iter->bi_idx == 0,
- "Attempted to rewind iter beyond "
- "bvec's boundaries\n")) {
-   return false;
-   }
-   iter->bi_idx--;
-   iter->bi_bvec_done = __bvec_iter_bvec(bv, 
*iter)->bv_len;
-   continue;
-   }
-   bytes -= len;
-   iter->bi_size += len;
-   iter->bi_bvec_done -= len;
-   }
-   return true;
-}
-
 #define for_each_bvec(bvl, bio_vec, iter, start)   \
for (iter = (start);\
 (iter).bi_size &&  \
-- 
2.9.5



[Cluster-devel] [PATCH V14 05/18] block: introduce bio_for_each_mp_bvec() and rq_for_each_mp_bvec()

2019-01-21 Thread Ming Lei
bio_for_each_mp_bvec() is used for iterating over multi-page bvec for bio
split & merge code.

rq_for_each_mp_bvec() can be used for drivers which may handle the
multi-page bvec directly, so far loop is one perfect use case.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bio.h| 10 ++
 include/linux/blkdev.h |  4 
 2 files changed, 14 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 72b4f7be2106..730288145568 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -156,6 +156,16 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 #define bio_for_each_segment(bvl, bio, iter)   \
__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
 
+#define __bio_for_each_mp_bvec(bvl, bio, iter, start)  \
+   for (iter = (start);\
+(iter).bi_size &&  \
+   ((bvl = mp_bvec_iter_bvec((bio)->bi_io_vec, (iter))), 1); \
+bio_advance_iter((bio), &(iter), (bvl).bv_len))
+
+/* iterate over multi-page bvec */
+#define bio_for_each_mp_bvec(bvl, bio, iter)   \
+   __bio_for_each_mp_bvec(bvl, bio, iter, (bio)->bi_iter)
+
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
 static inline unsigned bio_segments(struct bio *bio)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 338604dff7d0..6ebae3ee8f44 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -797,6 +797,10 @@ struct req_iterator {
__rq_for_each_bio(_iter.bio, _rq)   \
bio_for_each_segment(bvl, _iter.bio, _iter.iter)
 
+#define rq_for_each_mp_bvec(bvl, _rq, _iter)   \
+   __rq_for_each_bio(_iter.bio, _rq)   \
+   bio_for_each_mp_bvec(bvl, _iter.bio, _iter.iter)
+
 #define rq_iter_last(bvec, _iter)  \
(_iter.bio->bi_next == NULL &&  \
 bio_iter_last(bvec, _iter.iter))
-- 
2.9.5



[Cluster-devel] [PATCH V14 02/18] block: don't use bio->bi_vcnt to figure out segment number

2019-01-21 Thread Ming Lei
It is wrong to use bio->bi_vcnt to figure out how many segments
there are in the bio even though CLONED flag isn't set on this bio,
because this bio may be splitted or advanced.

So always use bio_segments() in blk_recount_segments(), and it shouldn't
cause any performance loss now because the physical segment number is figured
out in blk_queue_split() and BIO_SEG_VALID is set meantime since
bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Fixes: 76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max 
segments")
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 71e9ac03f621..f85d878f313d 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -367,13 +367,7 @@ void blk_recalc_rq_segments(struct request *rq)
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-   unsigned short seg_cnt;
-
-   /* estimate segment number by bi_vcnt for non-cloned bio */
-   if (bio_flagged(bio, BIO_CLONED))
-   seg_cnt = bio_segments(bio);
-   else
-   seg_cnt = bio->bi_vcnt;
+   unsigned short seg_cnt = bio_segments(bio);
 
if (test_bit(QUEUE_FLAG_NO_SG_MERGE, >queue_flags) &&
(seg_cnt < queue_max_segments(q)))
-- 
2.9.5



[Cluster-devel] [PATCH V14 00/18] block: support multi-page bvec

2019-01-21 Thread Ming Lei
nus tree
- rebase on for-4.21/block
- address comments on V7
- add patches of killing NO_SG_MERGE

V7:
- include Christoph and Mike's bio_clone_bioset() patches, which is
  actually prepare patches for multipage bvec
- address Christoph's comments

V6:
- avoid to introduce lots of renaming, follow Jen's suggestion of
using the name of chunk for multipage io vector
- include Christoph's three prepare patches
- decrease stack usage for using bio_for_each_chunk_segment_all()
- address Kent's comment

V5:
- remove some of prepare patches, which have been merged already
- add bio_clone_seg_bioset() to fix DM's bio clone, which
is introduced by 18a25da84354c6b (dm: ensure bio submission follows
a depth-first tree walk)
- rebase on the latest block for-v4.18

V4:
- rename bio_for_each_segment*() as bio_for_each_page*(), rename
bio_segments() as bio_pages(), rename rq_for_each_segment() as
rq_for_each_pages(), because these helpers never return real
segment, and they always return single page bvec

- introducing segment_for_each_page_all()

- introduce new 
bio_for_each_segment*()/rq_for_each_segment()/bio_segments()
for returning real multipage segment

- rewrite segment_last_page()

- rename bvec iterator helper as suggested by Christoph

- replace comment with applying bio helpers as suggested by Christoph

- document usage of bio iterator helpers

- redefine BIO_MAX_PAGES as 256 to make the biggest bvec table
accommodated in 4K page

- move bio_alloc_pages() into bcache as suggested by Christoph

V3:
- rebase on v4.13-rc3 with for-next of block tree
- run more xfstests: xfs/ext4 over NVMe, Sata, DM(linear),
MD(raid1), and not see regressions triggered
- add Reviewed-by on some btrfs patches
- remove two MD patches because both are merged to linus tree
  already

V2:
- bvec table direct access in raid has been cleaned, so NO_MP
flag is dropped
- rebase on recent Neil Brown's change on bio and bounce code
- reorganize the patchset

V1:
- against v4.10-rc1 and some cleanup in V0 are in -linus already
- handle queue_virt_boundary() in mp bvec change and make NVMe happy
- further BTRFS cleanup
- remove QUEUE_FLAG_SPLIT_MP
- rename for two new helpers of bio_for_each_segment_all()
- fix bounce convertion
- address comments in V0

[1], http://marc.info/?l=linux-kernel=141680246629547=2
[2], https://patchwork.kernel.org/patch/9451523/
[3], http://marc.info/?t=14773544711=1=2
[4], http://marc.info/?l=linux-mm=147745525801433=2
[5], http://marc.info/?t=14956948457=1=2
[6], http://marc.info/?t=14982021534=1=2


Christoph Hellwig (1):
  btrfs: look at bi_size for repair decisions

Ming Lei (17):
  block: don't use bio->bi_vcnt to figure out segment number
  block: remove bvec_iter_rewind()
  block: introduce multi-page bvec helpers
  block: introduce bio_for_each_mp_bvec() and rq_for_each_mp_bvec()
  block: use bio_for_each_mp_bvec() to compute multi-page bvec count
  block: use bio_for_each_mp_bvec() to map sg
  block: introduce mp_bvec_last_segment()
  fs/buffer.c: use bvec iterator to truncate the bio
  btrfs: use mp_bvec_last_segment to get bio's last page
  block: loop: pass multi-page bvec to iov_iter
  bcache: avoid to use bio_for_each_segment_all() in
bch_bio_alloc_pages()
  block: allow bio_for_each_segment_all() to iterate over multi-page
bvec
  block: enable multipage bvecs
  block: always define BIO_MAX_PAGES as 256
  block: document usage of bio iterator helpers
  block: kill QUEUE_FLAG_NO_SG_MERGE
  block: kill BLK_MQ_F_SG_MERGE

 Documentation/block/biovecs.txt   |  25 +
 block/bio.c   |  49 ++---
 block/blk-merge.c | 210 +-
 block/blk-mq-debugfs.c|   2 -
 block/blk-mq.c|   3 -
 block/bounce.c|   6 +-
 drivers/block/loop.c  |  22 ++--
 drivers/block/nbd.c   |   2 +-
 drivers/block/rbd.c   |   2 +-
 drivers/block/skd_main.c  |   1 -
 drivers/block/xen-blkfront.c  |   2 +-
 drivers/md/bcache/btree.c |   3 +-
 drivers/md/bcache/util.c  |   6 +-
 drivers/md/dm-crypt.c |   3 +-
 drivers/md/dm-rq.c|   2 +-
 drivers/md/dm-table.c |  13 ---
 drivers/md/raid1.c|   3 +-
 drivers/mmc/core/queue.c  |   3 +-
 drivers/scsi/scsi_lib.c   |   2 +-
 drivers/staging/erofs/data.c  |   3 +-
 drivers/staging/erofs/unzip_vle.c |   3 +-
 fs/block_dev.c|   6 +-
 fs/btrfs/compression.c|   3 +-
 fs/btrfs/disk-io.c|   3 +

[Cluster-devel] [PATCH V14 01/18] btrfs: look at bi_size for repair decisions

2019-01-21 Thread Ming Lei
From: Christoph Hellwig 

bio_readpage_error currently uses bi_vcnt to decide if it is worth
retrying an I/O.  But the vector count is mostly an implementation
artifact - it really should figure out if there is more than a
single sector worth retrying.  Use bi_size for that and shift by
PAGE_SHIFT.  This really should be blocks/sectors, but given that
btrfs doesn't support a sector size different from the PAGE_SIZE
using the page size keeps the changes to a minimum.

Reviewed-by: Omar Sandoval 
Reviewed-by: David Sterba 
Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/extent_io.c | 2 +-
 include/linux/bio.h  | 6 --
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 52abe4082680..dc8ba3ee515d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2350,7 +2350,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
int read_mode = 0;
blk_status_t status;
int ret;
-   unsigned failed_bio_pages = bio_pages_all(failed_bio);
+   unsigned failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
 
BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7380b094dcca..72b4f7be2106 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -263,12 +263,6 @@ static inline void bio_get_last_bvec(struct bio *bio, 
struct bio_vec *bv)
bv->bv_len = iter.bi_bvec_done;
 }
 
-static inline unsigned bio_pages_all(struct bio *bio)
-{
-   WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
-   return bio->bi_vcnt;
-}
-
 static inline struct bio_vec *bio_first_bvec_all(struct bio *bio)
 {
WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
-- 
2.9.5



[Cluster-devel] [PATCH V13 17/19] block: document usage of bio iterator helpers

2019-01-11 Thread Ming Lei
Now multi-page bvec is supported, some helpers may return page by
page, meantime some may return segment by segment, this patch
documents the usage.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 Documentation/block/biovecs.txt | 25 +
 1 file changed, 25 insertions(+)

diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 25689584e6e0..ce6eccaf5df7 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -117,3 +117,28 @@ Other implications:
size limitations and the limitations of the underlying devices. Thus
there's no need to define ->merge_bvec_fn() callbacks for individual block
drivers.
+
+Usage of helpers:
+=
+
+* The following helpers whose names have the suffix of "_all" can only be used
+on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers
+shouldn't use them because the bio may have been split before it reached the
+driver.
+
+   bio_for_each_segment_all()
+   bio_first_bvec_all()
+   bio_first_page_all()
+   bio_last_bvec_all()
+
+* The following helpers iterate over single-page segment. The passed 'struct
+bio_vec' will contain a single-page IO vector during the iteration
+
+   bio_for_each_segment()
+   bio_for_each_segment_all()
+
+* The following helpers iterate over multi-page bvec. The passed 'struct
+bio_vec' will contain a multi-page IO vector during the iteration
+
+   bio_for_each_bvec()
+   rq_for_each_bvec()
-- 
2.9.5



[Cluster-devel] [PATCH V13 16/19] block: always define BIO_MAX_PAGES as 256

2019-01-11 Thread Ming Lei
Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
increase BIO_MAX_PAGES for it.

CONFIG_THP_SWAP needs to split one THP into normal pages and adds
them all to one bio. With multipage-bvec, it just takes one bvec to
hold them all.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 include/linux/bio.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1ece9f30294b..54ef81f11f83 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -34,15 +34,7 @@
 #define BIO_BUG_ON
 #endif
 
-#ifdef CONFIG_THP_SWAP
-#if HPAGE_PMD_NR > 256
-#define BIO_MAX_PAGES  HPAGE_PMD_NR
-#else
 #define BIO_MAX_PAGES  256
-#endif
-#else
-#define BIO_MAX_PAGES  256
-#endif
 
 #define bio_prio(bio)  (bio)->bi_ioprio
 #define bio_set_prio(bio, prio)((bio)->bi_ioprio = prio)
-- 
2.9.5



[Cluster-devel] [PATCH V13 18/19] block: kill QUEUE_FLAG_NO_SG_MERGE

2019-01-11 Thread Ming Lei
Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.

Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.

Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().

For another user of blk_recalc_rq_segments():

- run in partial completion branch of blk_update_request, which is an unusual 
case

- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is 
killed
since dm-rq is the only user.

Multi-page bvec is enabled now, not doing S/G merging is rather pointless with 
the
current setup of the I/O path, as it isn't going to save you a significant 
amount
of cycles.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c  | 31 ++-
 block/blk-mq-debugfs.c |  1 -
 block/blk-mq.c |  3 ---
 drivers/md/dm-table.c  | 13 -
 include/linux/blkdev.h |  1 -
 5 files changed, 6 insertions(+), 43 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index bf736d2b3710..dc4877eaf9f9 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -354,8 +354,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio)
 EXPORT_SYMBOL(blk_queue_split);
 
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
-struct bio *bio,
-bool no_sg_merge)
+struct bio *bio)
 {
struct bio_vec bv, bvprv = { NULL };
int prev = 0;
@@ -381,13 +380,6 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
nr_phys_segs = 0;
for_each_bio(bio) {
bio_for_each_bvec(bv, bio, iter) {
-   /*
-* If SG merging is disabled, each bio vector is
-* a segment
-*/
-   if (no_sg_merge)
-   goto new_segment;
-
if (prev) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
@@ -417,27 +409,16 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
 
 void blk_recalc_rq_segments(struct request *rq)
 {
-   bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE,
-   >q->queue_flags);
-
-   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio,
-   no_sg_merge);
+   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);
 }
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-   unsigned short seg_cnt = bio_segments(bio);
-
-   if (test_bit(QUEUE_FLAG_NO_SG_MERGE, >queue_flags) &&
-   (seg_cnt < queue_max_segments(q)))
-   bio->bi_phys_segments = seg_cnt;
-   else {
-   struct bio *nxt = bio->bi_next;
+   struct bio *nxt = bio->bi_next;
 
-   bio->bi_next = NULL;
-   bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio, false);
-   bio->bi_next = nxt;
-   }
+   bio->bi_next = NULL;
+   bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio);
+   bio->bi_next = nxt;
 
bio_set_flag(bio, BIO_SEG_VALID);
 }
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 90d68760af08..2f9a11ef5bad 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -128,7 +128,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(SAME_FORCE),
QUEUE_FLAG_NAME(DEAD),
QUEUE_FLAG_NAME(INIT_DONE),
-   QUEUE_FLAG_NAME(NO_SG_MERGE),
QUEUE_FLAG_NAME(POLL),
QUEUE_FLAG_NAME(WC),
QUEUE_FLAG_NAME(FUA),
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ba37b9e15e9..fa45817a7e62 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2829,9 +2829,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
set->map[HCTX_TYPE_POLL].nr_queues)
blk_queue_flag_set(QUEUE_FLAG_POLL, q);
 
-   if (!(set->flags & BLK_MQ_F_SG_MERGE))
-   blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, q);
-
q->sg_reserved_size = INT_MAX;
 
INIT_DELAYED_WORK(>requeue_work, blk_mq_requeue_work);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 4b1be754cc41..ba9481f1bf3c 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1698,14 +1698,6 @@ static int device_is_not_random(struct dm_target *ti, 
struct dm_dev *dev,
return q && !blk_queue_add_random(q);
 }
 
-static int queue_supports_sg_merge(str

[Cluster-devel] [PATCH V13 15/19] block: enable multipage bvecs

2019-01-11 Thread Ming Lei
This patch pulls the trigger for multi-page bvecs.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/bio.c | 22 +++---
 fs/iomap.c  |  4 ++--
 fs/xfs/xfs_aops.c   |  4 ++--
 include/linux/bio.h |  2 +-
 4 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 968b12fea564..83a2dfa417ca 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -753,6 +753,8 @@ EXPORT_SYMBOL(bio_add_pc_page);
  * @page: page to add
  * @len: length of the data to add
  * @off: offset of the data in @page
+ * @same_page: if %true only merge if the new data is in the same physical
+ * page as the last segment of the bio.
  *
  * Try to add the data at @page + @off to the last bvec of @bio.  This is a
  * a useful optimisation for file systems with a block size smaller than the
@@ -761,19 +763,25 @@ EXPORT_SYMBOL(bio_add_pc_page);
  * Return %true on success or %false on failure.
  */
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
-   unsigned int len, unsigned int off)
+   unsigned int len, unsigned int off, bool same_page)
 {
if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
return false;
 
if (bio->bi_vcnt > 0) {
struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
+   phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) +
+   bv->bv_offset + bv->bv_len - 1;
+   phys_addr_t page_addr = page_to_phys(page);
 
-   if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
-   bv->bv_len += len;
-   bio->bi_iter.bi_size += len;
-   return true;
-   }
+   if (vec_end_addr + 1 != page_addr + off)
+   return false;
+   if (same_page && (vec_end_addr & PAGE_MASK) != page_addr)
+   return false;
+
+   bv->bv_len += len;
+   bio->bi_iter.bi_size += len;
+   return true;
}
return false;
 }
@@ -819,7 +827,7 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
 int bio_add_page(struct bio *bio, struct page *page,
 unsigned int len, unsigned int offset)
 {
-   if (!__bio_try_merge_page(bio, page, len, offset)) {
+   if (!__bio_try_merge_page(bio, page, len, offset, false)) {
if (bio_full(bio))
return 0;
__bio_add_page(bio, page, len, offset);
diff --git a/fs/iomap.c b/fs/iomap.c
index af736acd9006..0c350e658b7f 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -318,7 +318,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
 */
sector = iomap_sector(iomap, pos);
if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
-   if (__bio_try_merge_page(ctx->bio, page, plen, poff))
+   if (__bio_try_merge_page(ctx->bio, page, plen, poff, true))
goto done;
is_contig = true;
}
@@ -349,7 +349,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
ctx->bio->bi_end_io = iomap_read_end_io;
}
 
-   __bio_add_page(ctx->bio, page, plen, poff);
+   bio_add_page(ctx->bio, page, plen, poff);
 done:
/*
 * Move the caller beyond our range so that it keeps making progress.
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1f1829e506e8..b9fd44168f61 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -616,12 +616,12 @@ xfs_add_to_ioend(
bdev, sector);
}
 
-   if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) {
+   if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff, true)) {
if (iop)
atomic_inc(>write_count);
if (bio_full(wpc->ioend->io_bio))
xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
-   __bio_add_page(wpc->ioend->io_bio, page, len, poff);
+   bio_add_page(wpc->ioend->io_bio, page, len, poff);
}
 
wpc->ioend->io_size += len;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c5231e5c7e85..1ece9f30294b 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -441,7 +441,7 @@ extern int bio_add_page(struct bio *, struct page *, 
unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
   unsigned int, unsigned int);
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
-   unsigned int len, unsigned int off);
+   unsigned int len, unsigned int off, bool same_page);
 void __bio_add_page(struct bio *bio, struct page *page,
unsigned int len, unsigned int off);
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
-- 
2.9.5



[Cluster-devel] [PATCH V13 19/19] block: kill BLK_MQ_F_SG_MERGE

2019-01-11 Thread Ming Lei
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c   | 1 -
 drivers/block/loop.c | 2 +-
 drivers/block/nbd.c  | 2 +-
 drivers/block/rbd.c  | 2 +-
 drivers/block/skd_main.c | 1 -
 drivers/block/xen-blkfront.c | 2 +-
 drivers/md/dm-rq.c   | 2 +-
 drivers/mmc/core/queue.c | 3 +--
 drivers/scsi/scsi_lib.c  | 2 +-
 include/linux/blk-mq.h   | 1 -
 10 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 2f9a11ef5bad..2ba0aa05ce13 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -250,7 +250,6 @@ static const char *const alloc_policy_name[] = {
 static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(SHOULD_MERGE),
HCTX_FLAG_NAME(TAG_SHARED),
-   HCTX_FLAG_NAME(SG_MERGE),
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 28dd22c6f83f..e3b9212ec7a1 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1906,7 +1906,7 @@ static int loop_add(struct loop_device **l, int i)
lo->tag_set.queue_depth = 128;
lo->tag_set.numa_node = NUMA_NO_NODE;
lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-   lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
lo->tag_set.driver_data = lo;
 
err = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 08696f5f00bb..999c94de78e5 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1570,7 +1570,7 @@ static int nbd_dev_add(int index)
nbd->tag_set.numa_node = NUMA_NO_NODE;
nbd->tag_set.cmd_size = sizeof(struct nbd_cmd);
nbd->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-   BLK_MQ_F_SG_MERGE | BLK_MQ_F_BLOCKING;
+   BLK_MQ_F_BLOCKING;
nbd->tag_set.driver_data = nbd;
 
err = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 8e5140bbf241..3dfd300b5283 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3988,7 +3988,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
rbd_dev->tag_set.ops = _mq_ops;
rbd_dev->tag_set.queue_depth = rbd_dev->opts->queue_depth;
rbd_dev->tag_set.numa_node = NUMA_NO_NODE;
-   rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
rbd_dev->tag_set.nr_hw_queues = 1;
rbd_dev->tag_set.cmd_size = sizeof(struct work_struct);
 
diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index a10d5736d8f7..a7040f9a1b1b 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -2843,7 +2843,6 @@ static int skd_cons_disk(struct skd_device *skdev)
skdev->sgs_per_request * sizeof(struct scatterlist);
skdev->tag_set.numa_node = NUMA_NO_NODE;
skdev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-   BLK_MQ_F_SG_MERGE |
BLK_ALLOC_POLICY_TO_MQ_FLAG(BLK_TAG_ALLOC_FIFO);
skdev->tag_set.driver_data = skdev;
rc = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 0ed4b200fa58..d43a5677ccbc 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -977,7 +977,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
} else
info->tag_set.queue_depth = BLK_RING_SIZE(info);
info->tag_set.numa_node = NUMA_NO_NODE;
-   info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
info->tag_set.cmd_size = sizeof(struct blkif_req);
info->tag_set.driver_data = info;
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 4eb5f8c56535..b2f8eb2365ee 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -527,7 +527,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, 
struct dm_table *t)
md->tag_set->ops = _mq_ops;
md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
md->tag_set->numa_node = md->numa_node_id;
-   md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
md->tag_set->driver_data = md;
 
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 35cc138b096d..cc19e71c71d4 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -410,8 +410,7 @@ int mmc_init_queue(struct mmc_queue *mq,

[Cluster-devel] [PATCH V13 14/19] block: allow bio_for_each_segment_all() to iterate over multi-page bvec

2019-01-11 Thread Ming Lei
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all 
bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/bio.c   | 27 ++-
 block/bounce.c|  6 --
 drivers/md/bcache/btree.c |  3 ++-
 drivers/md/dm-crypt.c |  3 ++-
 drivers/md/raid1.c|  3 ++-
 drivers/staging/erofs/data.c  |  3 ++-
 drivers/staging/erofs/unzip_vle.c |  3 ++-
 fs/block_dev.c|  6 --
 fs/btrfs/compression.c|  3 ++-
 fs/btrfs/disk-io.c|  3 ++-
 fs/btrfs/extent_io.c  |  9 ++---
 fs/btrfs/inode.c  |  6 --
 fs/btrfs/raid56.c |  3 ++-
 fs/crypto/bio.c   |  3 ++-
 fs/direct-io.c|  4 +++-
 fs/exofs/ore.c|  3 ++-
 fs/exofs/ore_raid.c   |  3 ++-
 fs/ext4/page-io.c |  3 ++-
 fs/ext4/readpage.c|  3 ++-
 fs/f2fs/data.c|  9 ++---
 fs/gfs2/lops.c|  6 --
 fs/gfs2/meta_io.c |  3 ++-
 fs/iomap.c|  6 --
 fs/mpage.c|  3 ++-
 fs/xfs/xfs_aops.c |  5 +++--
 include/linux/bio.h   | 11 +--
 include/linux/bvec.h  | 30 ++
 27 files changed, 125 insertions(+), 45 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4db1008309ed..968b12fea564 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1072,8 +1072,9 @@ static int bio_copy_from_iter(struct bio *bio, struct 
iov_iter *iter)
 {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
ssize_t ret;
 
ret = copy_page_from_iter(bvec->bv_page,
@@ -1103,8 +1104,9 @@ static int bio_copy_to_iter(struct bio *bio, struct 
iov_iter iter)
 {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
ssize_t ret;
 
ret = copy_page_to_iter(bvec->bv_page,
@@ -1126,8 +1128,9 @@ void bio_free_pages(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i)
+   bio_for_each_segment_all(bvec, bio, i, iter_all)
__free_page(bvec->bv_page);
 }
 EXPORT_SYMBOL(bio_free_pages);
@@ -1295,6 +1298,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
struct bio *bio;
int ret;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
if (!iov_iter_count(iter))
return ERR_PTR(-EINVAL);
@@ -1368,7 +1372,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
return bio;
 
  out_unmap:
-   bio_for_each_segment_all(bvec, bio, j) {
+   bio_for_each_segment_all(bvec, bio, j, iter_all) {
put_page(bvec->bv_page);
}
bio_put(bio);
@@ -1379,11 +1383,12 @@ static void __bio_unmap_user(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
/*
 * make sure we dirty pages we wrote to
 */
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
if (bio_data_dir(bio) == READ)
set_page_dirty_lock(bvec->bv_page);
 
@@ -1475,8 +1480,9 @@ static void bio_copy_kern_endio_read(struct bio *bio)
char *p = bio->bi_private;
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
p += bvec->bv_len;
}
@@ -1585,8 +1591,9 @@ void bio_set_pages_dirty(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
if (!PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
}
@@ -1596,8 +1603,9 @@ static void bio_release_pages(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-

[Cluster-devel] [PATCH V13 13/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()

2019-01-11 Thread Ming Lei
bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.

Acked-by: Coly Li 
Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 drivers/md/bcache/util.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 20eddeac1531..62fb917f7a4f 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -270,7 +270,11 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
int i;
struct bio_vec *bv;
 
-   bio_for_each_segment_all(bv, bio, i) {
+   /*
+* This is called on freshly new bio, so it is safe to access the
+* bvec table directly.
+*/
+   for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++, i++) {
bv->bv_page = alloc_page(gfp_mask);
if (!bv->bv_page) {
while (--bv >= bio->bi_io_vec)
-- 
2.9.5



[Cluster-devel] [PATCH V13 12/19] block: loop: pass multi-page bvec to iov_iter

2019-01-11 Thread Ming Lei
iov_iter is implemented on bvec itererator helpers, so it is safe to pass
multi-page bvec to it, and this way is much more efficient than passing one
page in each bvec.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 drivers/block/loop.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index b8a0720d3653..28dd22c6f83f 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -511,21 +511,22 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 loff_t pos, bool rw)
 {
struct iov_iter iter;
+   struct req_iterator rq_iter;
struct bio_vec *bvec;
struct request *rq = blk_mq_rq_from_pdu(cmd);
struct bio *bio = rq->bio;
struct file *file = lo->lo_backing_file;
+   struct bio_vec tmp;
unsigned int offset;
-   int segments = 0;
+   int nr_bvec = 0;
int ret;
 
+   rq_for_each_bvec(tmp, rq, rq_iter)
+   nr_bvec++;
+
if (rq->bio != rq->biotail) {
-   struct req_iterator iter;
-   struct bio_vec tmp;
 
-   __rq_for_each_bio(bio, rq)
-   segments += bio_segments(bio);
-   bvec = kmalloc_array(segments, sizeof(struct bio_vec),
+   bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
 GFP_NOIO);
if (!bvec)
return -EIO;
@@ -534,10 +535,10 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
/*
 * The bios of the request may be started from the middle of
 * the 'bvec' because of bio splitting, so we can't directly
-* copy bio->bi_iov_vec to new bvec. The rq_for_each_segment
+* copy bio->bi_iov_vec to new bvec. The rq_for_each_bvec
 * API will take care of all details for us.
 */
-   rq_for_each_segment(tmp, rq, iter) {
+   rq_for_each_bvec(tmp, rq, rq_iter) {
*bvec = tmp;
bvec++;
}
@@ -551,11 +552,10 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 */
offset = bio->bi_iter.bi_bvec_done;
bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
-   segments = bio_segments(bio);
}
atomic_set(>ref, 2);
 
-   iov_iter_bvec(, rw, bvec, segments, blk_rq_bytes(rq));
+   iov_iter_bvec(, rw, bvec, nr_bvec, blk_rq_bytes(rq));
iter.iov_offset = offset;
 
cmd->iocb.ki_pos = pos;
-- 
2.9.5



[Cluster-devel] [PATCH V13 11/19] btrfs: use bvec_last_segment to get bio's last page

2019-01-11 Thread Ming Lei
Preparing for supporting multi-page bvec.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 fs/btrfs/extent_io.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dc8ba3ee515d..c092f88700bd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2697,11 +2697,12 @@ static int __must_check submit_one_bio(struct bio *bio, 
int mirror_num,
 {
blk_status_t ret = 0;
struct bio_vec *bvec = bio_last_bvec_all(bio);
-   struct page *page = bvec->bv_page;
+   struct bio_vec bv;
struct extent_io_tree *tree = bio->bi_private;
u64 start;
 
-   start = page_offset(page) + bvec->bv_offset;
+   bvec_last_segment(bvec, );
+   start = page_offset(bv.bv_page) + bv.bv_offset;
 
bio->bi_private = NULL;
 
-- 
2.9.5



[Cluster-devel] [PATCH V13 09/19] block: introduce bvec_last_segment()

2019-01-11 Thread Ming Lei
BTRFS and guard_bio_eod() need to get the last singlepage segment
from one multipage bvec, so introduce this helper to make them happy.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index d441486db605..ca6e630f88ab 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -131,4 +131,26 @@ static inline bool bvec_iter_advance(const struct bio_vec 
*bv,
.bi_bvec_done   = 0,\
 }
 
+/*
+ * Get the last single-page segment from the multi-page bvec and store it
+ * in @seg
+ */
+static inline void bvec_last_segment(const struct bio_vec *bvec,
+struct bio_vec *seg)
+{
+   unsigned total = bvec->bv_offset + bvec->bv_len;
+   unsigned last_page = (total - 1) / PAGE_SIZE;
+
+   seg->bv_page = nth_page(bvec->bv_page, last_page);
+
+   /* the whole segment is inside the last page */
+   if (bvec->bv_offset >= last_page * PAGE_SIZE) {
+   seg->bv_offset = bvec->bv_offset % PAGE_SIZE;
+   seg->bv_len = bvec->bv_len;
+   } else {
+   seg->bv_offset = 0;
+   seg->bv_len = total - last_page * PAGE_SIZE;
+   }
+}
+
 #endif /* __LINUX_BVEC_ITER_H */
-- 
2.9.5



[Cluster-devel] [PATCH V13 08/19] block: use bio_for_each_bvec() to map sg

2019-01-11 Thread Ming Lei
It is more efficient to use bio_for_each_bvec() to map sg, meantime
we have to consider splitting multipage bvec as done in blk_bio_segment_split().

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 70 +++
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index abe1c89c1253..bf736d2b3710 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -460,6 +460,54 @@ static int blk_phys_contig_segment(struct request_queue 
*q, struct bio *bio,
return biovec_phys_mergeable(q, _bv, _bv);
 }
 
+static struct scatterlist *blk_next_sg(struct scatterlist **sg,
+   struct scatterlist *sglist)
+{
+   if (!*sg)
+   return sglist;
+
+   /*
+* If the driver previously mapped a shorter list, we could see a
+* termination bit prematurely unless it fully inits the sg table
+* on each mapping. We KNOW that there must be more entries here
+* or the driver would be buggy, so force clear the termination bit
+* to avoid doing a full sg_init_table() in drivers for each command.
+*/
+   sg_unmark_end(*sg);
+   return sg_next(*sg);
+}
+
+static unsigned blk_bvec_map_sg(struct request_queue *q,
+   struct bio_vec *bvec, struct scatterlist *sglist,
+   struct scatterlist **sg)
+{
+   unsigned nbytes = bvec->bv_len;
+   unsigned nsegs = 0, total = 0, offset = 0;
+
+   while (nbytes > 0) {
+   unsigned seg_size;
+   struct page *pg;
+   unsigned idx;
+
+   *sg = blk_next_sg(sg, sglist);
+
+   seg_size = get_max_segment_size(q, bvec->bv_offset + total);
+   seg_size = min(nbytes, seg_size);
+
+   offset = (total + bvec->bv_offset) % PAGE_SIZE;
+   idx = (total + bvec->bv_offset) / PAGE_SIZE;
+   pg = nth_page(bvec->bv_page, idx);
+
+   sg_set_page(*sg, pg, seg_size, offset);
+
+   total += seg_size;
+   nbytes -= seg_size;
+   nsegs++;
+   }
+
+   return nsegs;
+}
+
 static inline void
 __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 struct scatterlist *sglist, struct bio_vec *bvprv,
@@ -477,25 +525,7 @@ __blk_segment_map_sg(struct request_queue *q, struct 
bio_vec *bvec,
(*sg)->length += nbytes;
} else {
 new_segment:
-   if (!*sg)
-   *sg = sglist;
-   else {
-   /*
-* If the driver previously mapped a shorter
-* list, we could see a termination bit
-* prematurely unless it fully inits the sg
-* table on each mapping. We KNOW that there
-* must be more entries here or the driver
-* would be buggy, so force clear the
-* termination bit to avoid doing a full
-* sg_init_table() in drivers for each command.
-*/
-   sg_unmark_end(*sg);
-   *sg = sg_next(*sg);
-   }
-
-   sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
-   (*nsegs)++;
+   (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
}
*bvprv = *bvec;
 }
@@ -517,7 +547,7 @@ static int __blk_bios_map_sg(struct request_queue *q, 
struct bio *bio,
int nsegs = 0;
 
for_each_bio(bio)
-   bio_for_each_segment(bvec, bio, iter)
+   bio_for_each_bvec(bvec, bio, iter)
__blk_segment_map_sg(q, , sglist, , sg,
 );
 
-- 
2.9.5



[Cluster-devel] [PATCH V13 10/19] fs/buffer.c: use bvec iterator to truncate the bio

2019-01-11 Thread Ming Lei
Once multi-page bvec is enabled, the last bvec may include more than one
page, this patch use bvec_last_segment() to truncate the bio.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 fs/buffer.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 52d024bfdbc1..fb72ac21f2b1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3032,7 +3032,10 @@ void guard_bio_eod(int op, struct bio *bio)
 
/* ..and clear the end of the buffer for reads */
if (op == REQ_OP_READ) {
-   zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
+   struct bio_vec bv;
+
+   bvec_last_segment(bvec, );
+   zero_user(bv.bv_page, bv.bv_offset + bv.bv_len,
truncated_bytes);
}
 }
-- 
2.9.5



[Cluster-devel] [PATCH V13 05/19] block: introduce multi-page bvec helpers

2019-01-11 Thread Ming Lei
This patch introduces helpers of 'bvec_iter_*' for multi-page bvec
support.

The introduced helpers treate one bvec as real multi-page segment,
which may include more than one pages.

The existed helpers of bvec_iter_* are interfaces for supporting current
bvec iterator which is thought as single-page by drivers, fs, dm and
etc. These introduced helpers will build single-page bvec in flight, so
this way won't break current bio/bvec users, which needn't any change.

Follows some multi-page bvec background:

- bvecs stored in bio->bi_io_vec is always multi-page style

- bvec(struct bio_vec) represents one physically contiguous I/O
  buffer, now the buffer may include more than one page after
  multi-page bvec is supported, and all these pages represented
  by one bvec is physically contiguous. Before multi-page bvec
  support, at most one page is included in one bvec, we call it
  single-page bvec.

- .bv_page of the bvec points to the 1st page in the multi-page bvec

- .bv_offset of the bvec is the offset of the buffer in the bvec

The effect on the current drivers/filesystem/dm/bcache/...:

- almost everyone supposes that one bvec only includes one single
  page, so we keep the sp interface not changed, for example,
  bio_for_each_segment() still returns single-page bvec

- bio_for_each_segment_all() will return single-page bvec too

- during iterating, iterator variable(struct bvec_iter) is always
  updated in multi-page bvec style, and bvec_iter_advance() is kept
  not changed

- returned(copied) single-page bvec is built in flight by bvec
  helpers from the stored multi-page bvec

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 25 -
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 716a87b26a6a..babc6316c117 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -50,16 +51,32 @@ struct bvec_iter {
  */
 #define __bvec_iter_bvec(bvec, iter)   (&(bvec)[(iter).bi_idx])
 
-#define segment_iter_page(bvec, iter)  \
+/* multi-page (segment) helpers */
+#define bvec_iter_page(bvec, iter) \
(__bvec_iter_bvec((bvec), (iter))->bv_page)
 
-#define segment_iter_len(bvec, iter)   \
+#define bvec_iter_len(bvec, iter)  \
min((iter).bi_size, \
__bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
 
-#define segment_iter_offset(bvec, iter)\
+#define bvec_iter_offset(bvec, iter)   \
(__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
 
+#define bvec_iter_page_idx(bvec, iter) \
+   (bvec_iter_offset((bvec), (iter)) / PAGE_SIZE)
+
+/* For building single-page bvec(segment) in flight */
+ #define segment_iter_offset(bvec, iter)   \
+   (bvec_iter_offset((bvec), (iter)) % PAGE_SIZE)
+
+#define segment_iter_len(bvec, iter)   \
+   min_t(unsigned, bvec_iter_len((bvec), (iter)),  \
+ PAGE_SIZE - segment_iter_offset((bvec), (iter)))
+
+#define segment_iter_page(bvec, iter)  \
+   nth_page(bvec_iter_page((bvec), (iter)),\
+bvec_iter_page_idx((bvec), (iter)))
+
 #define segment_iter_bvec(bvec, iter)  \
 ((struct bio_vec) {\
.bv_page= segment_iter_page((bvec), (iter)),\
@@ -67,8 +84,6 @@ struct bvec_iter {
.bv_offset  = segment_iter_offset((bvec), (iter)),  \
 })
 
-#define bvec_iter_len  segment_iter_len
-
 static inline bool bvec_iter_advance(const struct bio_vec *bv,
struct bvec_iter *iter, unsigned bytes)
 {
-- 
2.9.5



[Cluster-devel] [PATCH V13 06/19] block: introduce bio_for_each_bvec() and rq_for_each_bvec()

2019-01-11 Thread Ming Lei
bio_for_each_bvec() is used for iterating over multi-page bvec for bio
split & merge code.

rq_for_each_bvec() can be used for drivers which may handle the
multi-page bvec directly, so far loop is one perfect use case.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bio.h| 10 ++
 include/linux/blkdev.h |  4 
 include/linux/bvec.h   |  7 +++
 3 files changed, 21 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 16a65361535f..06888d45beb4 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -156,6 +156,16 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 #define bio_for_each_segment(bvl, bio, iter)   \
__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
 
+#define __bio_for_each_bvec(bvl, bio, iter, start) \
+   for (iter = (start);\
+(iter).bi_size &&  \
+   ((bvl = bvec_iter_bvec((bio)->bi_io_vec, (iter))), 1); \
+bio_advance_iter((bio), &(iter), (bvl).bv_len))
+
+/* iterate over multi-page bvec */
+#define bio_for_each_bvec(bvl, bio, iter)  \
+   __bio_for_each_bvec(bvl, bio, iter, (bio)->bi_iter)
+
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
 static inline unsigned bio_segments(struct bio *bio)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 338604dff7d0..7f4ca073e2f3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -797,6 +797,10 @@ struct req_iterator {
__rq_for_each_bio(_iter.bio, _rq)   \
bio_for_each_segment(bvl, _iter.bio, _iter.iter)
 
+#define rq_for_each_bvec(bvl, _rq, _iter)  \
+   __rq_for_each_bio(_iter.bio, _rq)   \
+   bio_for_each_bvec(bvl, _iter.bio, _iter.iter)
+
 #define rq_iter_last(bvec, _iter)  \
(_iter.bio->bi_next == NULL &&  \
 bio_iter_last(bvec, _iter.iter))
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index babc6316c117..d441486db605 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -65,6 +65,13 @@ struct bvec_iter {
 #define bvec_iter_page_idx(bvec, iter) \
(bvec_iter_offset((bvec), (iter)) / PAGE_SIZE)
 
+#define bvec_iter_bvec(bvec, iter) \
+((struct bio_vec) {\
+   .bv_page= bvec_iter_page((bvec), (iter)),   \
+   .bv_len = bvec_iter_len((bvec), (iter)),\
+   .bv_offset  = bvec_iter_offset((bvec), (iter)), \
+})
+
 /* For building single-page bvec(segment) in flight */
  #define segment_iter_offset(bvec, iter)   \
(bvec_iter_offset((bvec), (iter)) % PAGE_SIZE)
-- 
2.9.5



[Cluster-devel] [PATCH V13 07/19] block: use bio_for_each_bvec() to compute multi-page bvec count

2019-01-11 Thread Ming Lei
First it is more efficient to use bio_for_each_bvec() in both
blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
many multi-page bvecs there are in the bio.

Secondly once bio_for_each_bvec() is used, the bvec may need to be
splitted because its length can be very longer than max segment size,
so we have to split the big bvec into several segments.

Thirdly when splitting multi-page bvec into segments, the max segment
limit may be reached, so the bio split need to be considered under
this situation too.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 99 ---
 1 file changed, 79 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index f85d878f313d..abe1c89c1253 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -161,6 +161,69 @@ static inline unsigned get_max_io_size(struct 
request_queue *q,
return sectors;
 }
 
+static unsigned get_max_segment_size(struct request_queue *q,
+unsigned offset)
+{
+   unsigned long mask = queue_segment_boundary(q);
+
+   return min_t(unsigned long, mask - (mask & offset) + 1,
+queue_max_segment_size(q));
+}
+
+/*
+ * Split the bvec @bv into segments, and update all kinds of
+ * variables.
+ */
+static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
+   unsigned *nsegs, unsigned *last_seg_size,
+   unsigned *front_seg_size, unsigned *sectors)
+{
+   unsigned len = bv->bv_len;
+   unsigned total_len = 0;
+   unsigned new_nsegs = 0, seg_size = 0;
+
+   /*
+* Multi-page bvec may be too big to hold in one segment, so the
+* current bvec has to be splitted as multiple segments.
+*/
+   while (len && new_nsegs + *nsegs < queue_max_segments(q)) {
+   seg_size = get_max_segment_size(q, bv->bv_offset + total_len);
+   seg_size = min(seg_size, len);
+
+   new_nsegs++;
+   total_len += seg_size;
+   len -= seg_size;
+
+   if ((bv->bv_offset + total_len) & queue_virt_boundary(q))
+   break;
+   }
+
+   if (!new_nsegs)
+   return !!len;
+
+   /* update front segment size */
+   if (!*nsegs) {
+   unsigned first_seg_size;
+
+   if (new_nsegs == 1)
+   first_seg_size = get_max_segment_size(q, bv->bv_offset);
+   else
+   first_seg_size = queue_max_segment_size(q);
+
+   if (*front_seg_size < first_seg_size)
+   *front_seg_size = first_seg_size;
+   }
+
+   /* update other varibles */
+   *last_seg_size = seg_size;
+   *nsegs += new_nsegs;
+   if (sectors)
+   *sectors += total_len >> 9;
+
+   /* split in the middle of the bvec if len != 0 */
+   return !!len;
+}
+
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 struct bio *bio,
 struct bio_set *bs,
@@ -174,7 +237,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
 
-   bio_for_each_segment(bv, bio, iter) {
+   bio_for_each_bvec(bv, bio, iter) {
/*
 * If the queue doesn't support SG gaps and adding this
 * offset would create a gap, disallow it.
@@ -189,8 +252,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 */
if (nsegs < queue_max_segments(q) &&
sectors < max_sectors) {
-   nsegs++;
-   sectors = max_sectors;
+   /* split in the middle of bvec */
+   bv.bv_len = (max_sectors - sectors) << 9;
+   bvec_split_segs(q, , ,
+   _size,
+   _seg_size,
+   );
}
goto split;
}
@@ -212,14 +279,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
if (nsegs == queue_max_segments(q))
goto split;
 
-   if (nsegs == 1 && seg_size > front_seg_size)
-   front_seg_size = seg_size;
-
-   nsegs++;
bvprv = bv;
bvprvp = 
-   seg_size = bv.bv_len;
-   sectors += bv.bv_len >> 9;
+
+   if (bvec_split_segs(q, , , _size,
+  

[Cluster-devel] [PATCH V13 03/19] block: remove bvec_iter_rewind()

2019-01-11 Thread Ming Lei
Commit 7759eb23fd980 ("block: remove bio_rewind_iter()") removes
bio_rewind_iter(), then no one uses bvec_iter_rewind() any more,
so remove it.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 24 
 1 file changed, 24 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 02c73c6aa805..ba0ae40e77c9 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -92,30 +92,6 @@ static inline bool bvec_iter_advance(const struct bio_vec 
*bv,
return true;
 }
 
-static inline bool bvec_iter_rewind(const struct bio_vec *bv,
-struct bvec_iter *iter,
-unsigned int bytes)
-{
-   while (bytes) {
-   unsigned len = min(bytes, iter->bi_bvec_done);
-
-   if (iter->bi_bvec_done == 0) {
-   if (WARN_ONCE(iter->bi_idx == 0,
- "Attempted to rewind iter beyond "
- "bvec's boundaries\n")) {
-   return false;
-   }
-   iter->bi_idx--;
-   iter->bi_bvec_done = __bvec_iter_bvec(bv, 
*iter)->bv_len;
-   continue;
-   }
-   bytes -= len;
-   iter->bi_size += len;
-   iter->bi_bvec_done -= len;
-   }
-   return true;
-}
-
 #define for_each_bvec(bvl, bio_vec, iter, start)   \
for (iter = (start);\
 (iter).bi_size &&  \
-- 
2.9.5



[Cluster-devel] [PATCH V13 04/19] block: rename bvec helpers

2019-01-11 Thread Ming Lei
We will support multi-page bvec soon, and have to deal with
single-page vs multi-page bvec. This patch follows Christoph's
suggestion to rename all the following helpers:

for_each_bvec
bvec_iter_bvec
bvec_iter_len
bvec_iter_page
bvec_iter_offset

into:
for_each_segment
segment_iter_bvec
segment_iter_len
segment_iter_page
segment_iter_offset

so that these helpers named with 'segment' only deal with single-page
bvec, or called segment. We will introduce helpers named with 'bvec'
for multi-page bvec.

bvec_iter_advance() isn't renamed becasue this helper is always operated
on real bvec even though multi-page bvec is supported.

Acked-by: Miguel Ojeda 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Suggested-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 .clang-format  |  2 +-
 drivers/md/dm-integrity.c  |  2 +-
 drivers/md/dm-io.c |  4 ++--
 drivers/nvdimm/blk.c   |  4 ++--
 drivers/nvdimm/btt.c   |  4 ++--
 include/linux/bio.h| 10 +-
 include/linux/bvec.h   | 20 +++-
 include/linux/ceph/messenger.h |  2 +-
 lib/iov_iter.c |  2 +-
 net/ceph/messenger.c   | 14 +++---
 10 files changed, 33 insertions(+), 31 deletions(-)

diff --git a/.clang-format b/.clang-format
index e6080f5834a3..049200fbab94 100644
--- a/.clang-format
+++ b/.clang-format
@@ -120,7 +120,7 @@ ForEachMacros:
   - 'for_each_available_child_of_node'
   - 'for_each_bio'
   - 'for_each_board_func_rsrc'
-  - 'for_each_bvec'
+  - 'for_each_segment'
   - 'for_each_child_of_node'
   - 'for_each_clear_bit'
   - 'for_each_clear_bit_from'
diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
index 457200ca6287..046b7785e3f6 100644
--- a/drivers/md/dm-integrity.c
+++ b/drivers/md/dm-integrity.c
@@ -1574,7 +1574,7 @@ static bool __journal_read_write(struct dm_integrity_io 
*dio, struct bio *bio,
char *tag_ptr = journal_entry_tag(ic, je);
 
if (bip) do {
-   struct bio_vec biv = 
bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+   struct bio_vec biv = 
segment_iter_bvec(bip->bip_vec, bip->bip_iter);
unsigned tag_now = min(biv.bv_len, 
tag_todo);
char *tag_addr;
BUG_ON(PageHighMem(biv.bv_page));
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 81ffc59d05c9..d72ec2bdd333 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -208,8 +208,8 @@ static void list_dp_init(struct dpages *dp, struct 
page_list *pl, unsigned offse
 static void bio_get_page(struct dpages *dp, struct page **p,
 unsigned long *len, unsigned *offset)
 {
-   struct bio_vec bvec = bvec_iter_bvec((struct bio_vec *)dp->context_ptr,
-dp->context_bi);
+   struct bio_vec bvec = segment_iter_bvec((struct bio_vec 
*)dp->context_ptr,
+   dp->context_bi);
 
*p = bvec.bv_page;
*len = bvec.bv_len;
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index db45c6bbb7bb..dfae945216bb 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -89,9 +89,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk *nsblk,
struct bio_vec bv;
void *iobuf;
 
-   bv = bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+   bv = segment_iter_bvec(bip->bip_vec, bip->bip_iter);
/*
-* The 'bv' obtained from bvec_iter_bvec has its .bv_len and
+* The 'bv' obtained from segment_iter_bvec has its .bv_len and
 * .bv_offset already adjusted for iter->bi_bvec_done, and we
 * can use those directly
 */
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index b123b0dcf274..2bbbc90c7b91 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1154,9 +1154,9 @@ static int btt_rw_integrity(struct btt *btt, struct 
bio_integrity_payload *bip,
struct bio_vec bv;
void *mem;
 
-   bv = bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+   bv = segment_iter_bvec(bip->bip_vec, bip->bip_iter);
/*
-* The 'bv' obtained from bvec_iter_bvec has its .bv_len and
+* The 'bv' obtained from segment_iter_bvec has its .bv_len and
 * .bv_offset already adjusted for iter->bi_bvec_done, and we
 * can use those directly
 */
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 72b4f7be2106..16a65361535f 100644
--- a/include/linux/bio.h
+++ b/include

[Cluster-devel] [PATCH V13 02/19] block: don't use bio->bi_vcnt to figure out segment number

2019-01-11 Thread Ming Lei
It is wrong to use bio->bi_vcnt to figure out how many segments
there are in the bio even though CLONED flag isn't set on this bio,
because this bio may be splitted or advanced.

So always use bio_segments() in blk_recount_segments(), and it shouldn't
cause any performance loss now because the physical segment number is figured
out in blk_queue_split() and BIO_SEG_VALID is set meantime since
bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Fixes: 76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max 
segments")
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 71e9ac03f621..f85d878f313d 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -367,13 +367,7 @@ void blk_recalc_rq_segments(struct request *rq)
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-   unsigned short seg_cnt;
-
-   /* estimate segment number by bi_vcnt for non-cloned bio */
-   if (bio_flagged(bio, BIO_CLONED))
-   seg_cnt = bio_segments(bio);
-   else
-   seg_cnt = bio->bi_vcnt;
+   unsigned short seg_cnt = bio_segments(bio);
 
if (test_bit(QUEUE_FLAG_NO_SG_MERGE, >queue_flags) &&
(seg_cnt < queue_max_segments(q)))
-- 
2.9.5



[Cluster-devel] [PATCH V13 00/19] block: support multi-page bvec

2019-01-11 Thread Ming Lei
ow Jen's suggestion of
using the name of chunk for multipage io vector
- include Christoph's three prepare patches
- decrease stack usage for using bio_for_each_chunk_segment_all()
- address Kent's comment

V5:
- remove some of prepare patches, which have been merged already
- add bio_clone_seg_bioset() to fix DM's bio clone, which
is introduced by 18a25da84354c6b (dm: ensure bio submission follows
a depth-first tree walk)
- rebase on the latest block for-v4.18

V4:
- rename bio_for_each_segment*() as bio_for_each_page*(), rename
bio_segments() as bio_pages(), rename rq_for_each_segment() as
rq_for_each_pages(), because these helpers never return real
segment, and they always return single page bvec

- introducing segment_for_each_page_all()

- introduce new 
bio_for_each_segment*()/rq_for_each_segment()/bio_segments()
for returning real multipage segment

- rewrite segment_last_page()

- rename bvec iterator helper as suggested by Christoph

- replace comment with applying bio helpers as suggested by Christoph

- document usage of bio iterator helpers

- redefine BIO_MAX_PAGES as 256 to make the biggest bvec table
accommodated in 4K page

- move bio_alloc_pages() into bcache as suggested by Christoph

V3:
- rebase on v4.13-rc3 with for-next of block tree
- run more xfstests: xfs/ext4 over NVMe, Sata, DM(linear),
MD(raid1), and not see regressions triggered
- add Reviewed-by on some btrfs patches
- remove two MD patches because both are merged to linus tree
  already

V2:
- bvec table direct access in raid has been cleaned, so NO_MP
flag is dropped
- rebase on recent Neil Brown's change on bio and bounce code
- reorganize the patchset

V1:
- against v4.10-rc1 and some cleanup in V0 are in -linus already
- handle queue_virt_boundary() in mp bvec change and make NVMe happy
- further BTRFS cleanup
- remove QUEUE_FLAG_SPLIT_MP
- rename for two new helpers of bio_for_each_segment_all()
- fix bounce convertion
- address comments in V0

[1], http://marc.info/?l=linux-kernel=141680246629547=2
[2], https://patchwork.kernel.org/patch/9451523/
[3], http://marc.info/?t=14773544711=1=2
[4], http://marc.info/?l=linux-mm=147745525801433=2
[5], http://marc.info/?t=14956948457=1=2
[6], http://marc.info/?t=14982021534=1=2




Christoph Hellwig (1):
  btrfs: look at bi_size for repair decisions

Ming Lei (18):
  block: don't use bio->bi_vcnt to figure out segment number
  block: remove bvec_iter_rewind()
  block: rename bvec helpers
  block: introduce multi-page bvec helpers
  block: introduce bio_for_each_bvec() and rq_for_each_bvec()
  block: use bio_for_each_bvec() to compute multi-page bvec count
  block: use bio_for_each_bvec() to map sg
  block: introduce bvec_last_segment()
  fs/buffer.c: use bvec iterator to truncate the bio
  btrfs: use bvec_last_segment to get bio's last page
  block: loop: pass multi-page bvec to iov_iter
  bcache: avoid to use bio_for_each_segment_all() in
bch_bio_alloc_pages()
  block: allow bio_for_each_segment_all() to iterate over multi-page
bvec
  block: enable multipage bvecs
  block: always define BIO_MAX_PAGES as 256
  block: document usage of bio iterator helpers
  block: kill QUEUE_FLAG_NO_SG_MERGE
  block: kill BLK_MQ_F_SG_MERGE

 .clang-format |   2 +-
 Documentation/block/biovecs.txt   |  25 +
 block/bio.c   |  49 ++---
 block/blk-merge.c | 206 +-
 block/blk-mq-debugfs.c|   2 -
 block/blk-mq.c|   3 -
 block/bounce.c|   6 +-
 drivers/block/loop.c  |  22 ++--
 drivers/block/nbd.c   |   2 +-
 drivers/block/rbd.c   |   2 +-
 drivers/block/skd_main.c  |   1 -
 drivers/block/xen-blkfront.c  |   2 +-
 drivers/md/bcache/btree.c |   3 +-
 drivers/md/bcache/util.c  |   6 +-
 drivers/md/dm-crypt.c |   3 +-
 drivers/md/dm-integrity.c |   2 +-
 drivers/md/dm-io.c|   4 +-
 drivers/md/dm-rq.c|   2 +-
 drivers/md/dm-table.c |  13 ---
 drivers/md/raid1.c|   3 +-
 drivers/mmc/core/queue.c  |   3 +-
 drivers/nvdimm/blk.c  |   4 +-
 drivers/nvdimm/btt.c  |   4 +-
 drivers/scsi/scsi_lib.c   |   2 +-
 drivers/staging/erofs/data.c  |   3 +-
 drivers/staging/erofs/unzip_vle.c |   3 +-
 fs/block_dev.c|   6 +-
 fs/btrfs/compression.c|   3 +-
 fs/btrfs/disk-io.c|   3 +-
 fs/btrfs/extent_io.c  |  16 +--
 fs/btrfs/inode.c  |   6 +-
 fs/btrfs/raid56.c   

[Cluster-devel] [PATCH V13 01/19] btrfs: look at bi_size for repair decisions

2019-01-11 Thread Ming Lei
From: Christoph Hellwig 

bio_readpage_error currently uses bi_vcnt to decide if it is worth
retrying an I/O.  But the vector count is mostly an implementation
artifact - it really should figure out if there is more than a
single sector worth retrying.  Use bi_size for that and shift by
PAGE_SHIFT.  This really should be blocks/sectors, but given that
btrfs doesn't support a sector size different from the PAGE_SIZE
using the page size keeps the changes to a minimum.

Reviewed-by: Omar Sandoval 
Reviewed-by: David Sterba 
Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/extent_io.c | 2 +-
 include/linux/bio.h  | 6 --
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 52abe4082680..dc8ba3ee515d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2350,7 +2350,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
int read_mode = 0;
blk_status_t status;
int ret;
-   unsigned failed_bio_pages = bio_pages_all(failed_bio);
+   unsigned failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
 
BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7380b094dcca..72b4f7be2106 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -263,12 +263,6 @@ static inline void bio_get_last_bvec(struct bio *bio, 
struct bio_vec *bv)
bv->bv_len = iter.bi_bvec_done;
 }
 
-static inline unsigned bio_pages_all(struct bio *bio)
-{
-   WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
-   return bio->bi_vcnt;
-}
-
 static inline struct bio_vec *bio_first_bvec_all(struct bio *bio)
 {
WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
-- 
2.9.5



Re: [Cluster-devel] [PATCH V12 00/20] block: support multi-page bvec

2018-11-28 Thread Ming Lei
On Wed, Nov 28, 2018 at 07:20:51PM -0700, Jens Axboe wrote:
> On 11/28/18 6:30 PM, Ming Lei wrote:
> >> I'm going back and forth on those one a bit. Any concerns with
> >> pushing this to 4.22?
> > 
> > My only one concern is about the warning of
> > "blk_cloned_rq_check_limits: over max segments limit" on dm multipath,
> > and seems Ewan and Mike is waiting for this fix.
> 
> Not familiar with this issue, can you post a link to it? I'd be fine
> working around anything until 4.22, it's not going to be a new issue.

https://marc.info/?t=15330342572=1=2

thanks,
Ming



Re: [Cluster-devel] [PATCH V12 00/20] block: support multi-page bvec

2018-11-28 Thread Ming Lei
On Wed, Nov 28, 2018 at 06:44:00AM -0700, Jens Axboe wrote:
> On 11/25/18 7:17 PM, Ming Lei wrote:
> > Hi,
> > 
> > This patchset brings multi-page bvec into block layer:
> > 
> > 1) what is multi-page bvec?
> > 
> > Multipage bvecs means that one 'struct bio_bvec' can hold multiple pages
> > which are physically contiguous instead of one single page used in linux
> > kernel for long time.
> > 
> > 2) why is multi-page bvec introduced?
> > 
> > Kent proposed the idea[1] first. 
> > 
> > As system's RAM becomes much bigger than before, and huge page, transparent
> > huge page and memory compaction are widely used, it is a bit easy now
> > to see physically contiguous pages from fs in I/O. On the other hand, from
> > block layer's view, it isn't necessary to store intermediate pages into 
> > bvec,
> > and it is enough to just store the physicallly contiguous 'segment' in each
> > io vector.
> > 
> > Also huge pages are being brought to filesystem and swap [2][6], we can
> > do IO on a hugepage each time[3], which requires that one bio can transfer
> > at least one huge page one time. Turns out it isn't flexiable to change
> > BIO_MAX_PAGES simply[3][5]. Multipage bvec can fit in this case very well.
> > As we saw, if CONFIG_THP_SWAP is enabled, BIO_MAX_PAGES can be configured
> > as much bigger, such as 512, which requires at least two 4K pages for 
> > holding
> > the bvec table.
> 
> I'm pretty happy with this patchset at this point, looks like it just
> needs a respin to address the last comments. My only concern is whether

I will address the last comment from Omar on patch of '[PATCH V12 01/20] btrfs:
remove various bio_offset arguments', we may use the approach in V11 simply.

> it's a good idea to target this for 4.21, or if we should wait until
> 4.22. 4.21 has a fairly substantial amount of changes in terms of block
> already, it's not the best timing for something of this magnitude too.

Yeah, I understand.

> 
> I'm going back and forth on those one a bit. Any concerns with
> pushing this to 4.22?

My only one concern is about the warning of "blk_cloned_rq_check_limits:
over max segments limit" on dm multipath, and seems Ewan and Mike is waiting
for this fix.

thanks,
Ming



Re: [Cluster-devel] [PATCH V12 16/20] block: enable multipage bvecs

2018-11-26 Thread Ming Lei
On Mon, Nov 26, 2018 at 01:58:42PM +0100, Christoph Hellwig wrote:
> > +   phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) +
> > +   bv->bv_offset + bv->bv_len;
> 
> The name is a little confusing, as the real end addr would be -1.  Maybe
> throw the -1 in here, and adjust for it in the contigous check below?

Yeah, it makes sense.

thanks,
Ming



[Cluster-devel] [PATCH V12 20/20] block: kill BLK_MQ_F_SG_MERGE

2018-11-25 Thread Ming Lei
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c   | 1 -
 drivers/block/loop.c | 2 +-
 drivers/block/nbd.c  | 2 +-
 drivers/block/rbd.c  | 2 +-
 drivers/block/skd_main.c | 1 -
 drivers/block/xen-blkfront.c | 2 +-
 drivers/md/dm-rq.c   | 2 +-
 drivers/mmc/core/queue.c | 3 +--
 drivers/scsi/scsi_lib.c  | 2 +-
 include/linux/blk-mq.h   | 1 -
 10 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index d752fe4461af..a6ec055b54fa 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -249,7 +249,6 @@ static const char *const alloc_policy_name[] = {
 static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(SHOULD_MERGE),
HCTX_FLAG_NAME(TAG_SHARED),
-   HCTX_FLAG_NAME(SG_MERGE),
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index e3683211f12d..4cf5486689de 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1906,7 +1906,7 @@ static int loop_add(struct loop_device **l, int i)
lo->tag_set.queue_depth = 128;
lo->tag_set.numa_node = NUMA_NO_NODE;
lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-   lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
lo->tag_set.driver_data = lo;
 
err = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 08696f5f00bb..999c94de78e5 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1570,7 +1570,7 @@ static int nbd_dev_add(int index)
nbd->tag_set.numa_node = NUMA_NO_NODE;
nbd->tag_set.cmd_size = sizeof(struct nbd_cmd);
nbd->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-   BLK_MQ_F_SG_MERGE | BLK_MQ_F_BLOCKING;
+   BLK_MQ_F_BLOCKING;
nbd->tag_set.driver_data = nbd;
 
err = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 8e5140bbf241..3dfd300b5283 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3988,7 +3988,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
rbd_dev->tag_set.ops = _mq_ops;
rbd_dev->tag_set.queue_depth = rbd_dev->opts->queue_depth;
rbd_dev->tag_set.numa_node = NUMA_NO_NODE;
-   rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
rbd_dev->tag_set.nr_hw_queues = 1;
rbd_dev->tag_set.cmd_size = sizeof(struct work_struct);
 
diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index a10d5736d8f7..a7040f9a1b1b 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -2843,7 +2843,6 @@ static int skd_cons_disk(struct skd_device *skdev)
skdev->sgs_per_request * sizeof(struct scatterlist);
skdev->tag_set.numa_node = NUMA_NO_NODE;
skdev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-   BLK_MQ_F_SG_MERGE |
BLK_ALLOC_POLICY_TO_MQ_FLAG(BLK_TAG_ALLOC_FIFO);
skdev->tag_set.driver_data = skdev;
rc = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 0ed4b200fa58..d43a5677ccbc 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -977,7 +977,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
} else
info->tag_set.queue_depth = BLK_RING_SIZE(info);
info->tag_set.numa_node = NUMA_NO_NODE;
-   info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
info->tag_set.cmd_size = sizeof(struct blkif_req);
info->tag_set.driver_data = info;
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 1f1fe9a618ea..afbac62a02a2 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -536,7 +536,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, 
struct dm_table *t)
md->tag_set->ops = _mq_ops;
md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
md->tag_set->numa_node = md->numa_node_id;
-   md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
md->tag_set->driver_data = md;
 
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 35cc138b096d..cc19e71c71d4 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -410,8 +410,7 @@ int mmc_init_queue(struct mmc_queue *mq,

[Cluster-devel] [PATCH V12 19/20] block: kill QUEUE_FLAG_NO_SG_MERGE

2018-11-25 Thread Ming Lei
Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.

Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.

Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().

For another user of blk_recalc_rq_segments():

- run in partial completion branch of blk_update_request, which is an unusual 
case

- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is 
killed
since dm-rq is the only user.

Multi-page bvec is enabled now, not doing S/G merging is rather pointless with 
the
current setup of the I/O path, as it isn't going to save you a significant 
amount
of cycles.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c  | 31 ++-
 block/blk-mq-debugfs.c |  1 -
 block/blk-mq.c |  3 ---
 drivers/md/dm-table.c  | 13 -
 include/linux/blkdev.h |  1 -
 5 files changed, 6 insertions(+), 43 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 20b5b0c3e182..9a7fd8b1f90a 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -355,8 +355,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio)
 EXPORT_SYMBOL(blk_queue_split);
 
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
-struct bio *bio,
-bool no_sg_merge)
+struct bio *bio)
 {
struct bio_vec bv, bvprv = { NULL };
unsigned int seg_size, nr_phys_segs;
@@ -382,13 +381,6 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
nr_phys_segs = 0;
for_each_bio(bio) {
bio_for_each_bvec(bv, bio, iter) {
-   /*
-* If SG merging is disabled, each bio vector is
-* a segment
-*/
-   if (no_sg_merge)
-   goto new_segment;
-
if (prev) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
@@ -418,27 +410,16 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
 
 void blk_recalc_rq_segments(struct request *rq)
 {
-   bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE,
-   >q->queue_flags);
-
-   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio,
-   no_sg_merge);
+   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);
 }
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-   unsigned short seg_cnt = bio_segments(bio);
-
-   if (test_bit(QUEUE_FLAG_NO_SG_MERGE, >queue_flags) &&
-   (seg_cnt < queue_max_segments(q)))
-   bio->bi_phys_segments = seg_cnt;
-   else {
-   struct bio *nxt = bio->bi_next;
+   struct bio *nxt = bio->bi_next;
 
-   bio->bi_next = NULL;
-   bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio, false);
-   bio->bi_next = nxt;
-   }
+   bio->bi_next = NULL;
+   bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio);
+   bio->bi_next = nxt;
 
bio_set_flag(bio, BIO_SEG_VALID);
 }
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index a32bb79d6c95..d752fe4461af 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -127,7 +127,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(SAME_FORCE),
QUEUE_FLAG_NAME(DEAD),
QUEUE_FLAG_NAME(INIT_DONE),
-   QUEUE_FLAG_NAME(NO_SG_MERGE),
QUEUE_FLAG_NAME(POLL),
QUEUE_FLAG_NAME(WC),
QUEUE_FLAG_NAME(FUA),
diff --git a/block/blk-mq.c b/block/blk-mq.c
index b16204df65d1..7b17191d755b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2780,9 +2780,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
 
q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
 
-   if (!(set->flags & BLK_MQ_F_SG_MERGE))
-   blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, q);
-
q->sg_reserved_size = INT_MAX;
 
INIT_DELAYED_WORK(>requeue_work, blk_mq_requeue_work);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 844f7d0f2ef8..a41832cf0c98 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1698,14 +1698,6 @@ static int device_is_not_random(struct dm_target *ti, 
struct dm_dev *dev,
return q && !blk_queue_add_random(q);
 }
 
-static int queue_supports_sg_merge(struct dm_target *ti, struct dm_de

[Cluster-devel] [PATCH V12 18/20] block: document usage of bio iterator helpers

2018-11-25 Thread Ming Lei
Now multi-page bvec is supported, some helpers may return page by
page, meantime some may return segment by segment, this patch
documents the usage.

Signed-off-by: Ming Lei 
---
 Documentation/block/biovecs.txt | 25 +
 1 file changed, 25 insertions(+)

diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 25689584e6e0..ce6eccaf5df7 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -117,3 +117,28 @@ Other implications:
size limitations and the limitations of the underlying devices. Thus
there's no need to define ->merge_bvec_fn() callbacks for individual block
drivers.
+
+Usage of helpers:
+=
+
+* The following helpers whose names have the suffix of "_all" can only be used
+on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers
+shouldn't use them because the bio may have been split before it reached the
+driver.
+
+   bio_for_each_segment_all()
+   bio_first_bvec_all()
+   bio_first_page_all()
+   bio_last_bvec_all()
+
+* The following helpers iterate over single-page segment. The passed 'struct
+bio_vec' will contain a single-page IO vector during the iteration
+
+   bio_for_each_segment()
+   bio_for_each_segment_all()
+
+* The following helpers iterate over multi-page bvec. The passed 'struct
+bio_vec' will contain a multi-page IO vector during the iteration
+
+   bio_for_each_bvec()
+   rq_for_each_bvec()
-- 
2.9.5



[Cluster-devel] [PATCH V12 17/20] block: always define BIO_MAX_PAGES as 256

2018-11-25 Thread Ming Lei
Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
increase BIO_MAX_PAGES for it.

CONFIG_THP_SWAP needs to split one THP into normal pages and adds
them all to one bio. With multipage-bvec, it just takes one bvec to
hold them all.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 include/linux/bio.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5505f74aef8b..7be48c55b14a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -34,15 +34,7 @@
 #define BIO_BUG_ON
 #endif
 
-#ifdef CONFIG_THP_SWAP
-#if HPAGE_PMD_NR > 256
-#define BIO_MAX_PAGES  HPAGE_PMD_NR
-#else
 #define BIO_MAX_PAGES  256
-#endif
-#else
-#define BIO_MAX_PAGES  256
-#endif
 
 #define bio_prio(bio)  (bio)->bi_ioprio
 #define bio_set_prio(bio, prio)((bio)->bi_ioprio = prio)
-- 
2.9.5



[Cluster-devel] [PATCH V12 16/20] block: enable multipage bvecs

2018-11-25 Thread Ming Lei
This patch pulls the trigger for multi-page bvecs.

Signed-off-by: Ming Lei 
---
 block/bio.c | 22 +++---
 fs/iomap.c  |  4 ++--
 fs/xfs/xfs_aops.c   |  4 ++--
 include/linux/bio.h |  2 +-
 4 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 75fde30af51f..8bf9338d4783 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -753,6 +753,8 @@ EXPORT_SYMBOL(bio_add_pc_page);
  * @page: page to add
  * @len: length of the data to add
  * @off: offset of the data in @page
+ * @same_page: if %true only merge if the new data is in the same physical
+ * page as the last segment of the bio.
  *
  * Try to add the data at @page + @off to the last bvec of @bio.  This is a
  * a useful optimisation for file systems with a block size smaller than the
@@ -761,19 +763,25 @@ EXPORT_SYMBOL(bio_add_pc_page);
  * Return %true on success or %false on failure.
  */
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
-   unsigned int len, unsigned int off)
+   unsigned int len, unsigned int off, bool same_page)
 {
if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
return false;
 
if (bio->bi_vcnt > 0) {
struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
+   phys_addr_t vec_end_addr = page_to_phys(bv->bv_page) +
+   bv->bv_offset + bv->bv_len;
+   phys_addr_t page_addr = page_to_phys(page);
 
-   if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
-   bv->bv_len += len;
-   bio->bi_iter.bi_size += len;
-   return true;
-   }
+   if (vec_end_addr != page_addr + off)
+   return false;
+   if (same_page && ((vec_end_addr - 1) & PAGE_MASK) != page_addr)
+   return false;
+
+   bv->bv_len += len;
+   bio->bi_iter.bi_size += len;
+   return true;
}
return false;
 }
@@ -819,7 +827,7 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
 int bio_add_page(struct bio *bio, struct page *page,
 unsigned int len, unsigned int offset)
 {
-   if (!__bio_try_merge_page(bio, page, len, offset)) {
+   if (!__bio_try_merge_page(bio, page, len, offset, false)) {
if (bio_full(bio))
return 0;
__bio_add_page(bio, page, len, offset);
diff --git a/fs/iomap.c b/fs/iomap.c
index 1f648d098a3b..ec5527b0fba4 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -313,7 +313,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
 */
sector = iomap_sector(iomap, pos);
if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
-   if (__bio_try_merge_page(ctx->bio, page, plen, poff))
+   if (__bio_try_merge_page(ctx->bio, page, plen, poff, true))
goto done;
is_contig = true;
}
@@ -344,7 +344,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
ctx->bio->bi_end_io = iomap_read_end_io;
}
 
-   __bio_add_page(ctx->bio, page, plen, poff);
+   bio_add_page(ctx->bio, page, plen, poff);
 done:
/*
 * Move the caller beyond our range so that it keeps making progress.
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1f1829e506e8..b9fd44168f61 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -616,12 +616,12 @@ xfs_add_to_ioend(
bdev, sector);
}
 
-   if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) {
+   if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff, true)) {
if (iop)
atomic_inc(>write_count);
if (bio_full(wpc->ioend->io_bio))
xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
-   __bio_add_page(wpc->ioend->io_bio, page, len, poff);
+   bio_add_page(wpc->ioend->io_bio, page, len, poff);
}
 
wpc->ioend->io_size += len;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c35997dd02c2..5505f74aef8b 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -441,7 +441,7 @@ extern int bio_add_page(struct bio *, struct page *, 
unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
   unsigned int, unsigned int);
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
-   unsigned int len, unsigned int off);
+   unsigned int len, unsigned int off, bool same_page);
 void __bio_add_page(struct bio *bio, struct page *page,
unsigned int len, unsigned int off);
 int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter);
-- 
2.9.5



[Cluster-devel] [PATCH V12 12/20] fs/buffer.c: use bvec iterator to truncate the bio

2018-11-25 Thread Ming Lei
Once multi-page bvec is enabled, the last bvec may include more than one
page, this patch use bvec_last_segment() to truncate the bio.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 fs/buffer.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 1286c2b95498..fa37ad52e962 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3032,7 +3032,10 @@ void guard_bio_eod(int op, struct bio *bio)
 
/* ..and clear the end of the buffer for reads */
if (op == REQ_OP_READ) {
-   zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
+   struct bio_vec bv;
+
+   bvec_last_segment(bvec, );
+   zero_user(bv.bv_page, bv.bv_offset + bv.bv_len,
truncated_bytes);
}
 }
-- 
2.9.5



[Cluster-devel] [PATCH V12 14/20] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()

2018-11-25 Thread Ming Lei
bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.

Acked-by: Coly Li 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 drivers/md/bcache/util.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 20eddeac1531..62fb917f7a4f 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -270,7 +270,11 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
int i;
struct bio_vec *bv;
 
-   bio_for_each_segment_all(bv, bio, i) {
+   /*
+* This is called on freshly new bio, so it is safe to access the
+* bvec table directly.
+*/
+   for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++, i++) {
bv->bv_page = alloc_page(gfp_mask);
if (!bv->bv_page) {
while (--bv >= bio->bi_io_vec)
-- 
2.9.5



[Cluster-devel] [PATCH V12 15/20] block: allow bio_for_each_segment_all() to iterate over multi-page bvec

2018-11-25 Thread Ming Lei
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all 
bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/bio.c   | 27 ++-
 block/bounce.c|  6 --
 drivers/md/bcache/btree.c |  3 ++-
 drivers/md/dm-crypt.c |  3 ++-
 drivers/md/raid1.c|  3 ++-
 drivers/staging/erofs/data.c  |  3 ++-
 drivers/staging/erofs/unzip_vle.c |  3 ++-
 fs/block_dev.c|  6 --
 fs/btrfs/compression.c|  3 ++-
 fs/btrfs/disk-io.c|  3 ++-
 fs/btrfs/extent_io.c  |  9 ++---
 fs/btrfs/inode.c  |  6 --
 fs/btrfs/raid56.c |  3 ++-
 fs/crypto/bio.c   |  3 ++-
 fs/direct-io.c|  4 +++-
 fs/exofs/ore.c|  3 ++-
 fs/exofs/ore_raid.c   |  3 ++-
 fs/ext4/page-io.c |  3 ++-
 fs/ext4/readpage.c|  3 ++-
 fs/f2fs/data.c|  9 ++---
 fs/gfs2/lops.c|  6 --
 fs/gfs2/meta_io.c |  3 ++-
 fs/iomap.c|  6 --
 fs/mpage.c|  3 ++-
 fs/xfs/xfs_aops.c |  5 +++--
 include/linux/bio.h   | 11 +--
 include/linux/bvec.h  | 30 ++
 27 files changed, 125 insertions(+), 45 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 03895cc0d74a..75fde30af51f 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1073,8 +1073,9 @@ static int bio_copy_from_iter(struct bio *bio, struct 
iov_iter *iter)
 {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
ssize_t ret;
 
ret = copy_page_from_iter(bvec->bv_page,
@@ -1104,8 +1105,9 @@ static int bio_copy_to_iter(struct bio *bio, struct 
iov_iter iter)
 {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
ssize_t ret;
 
ret = copy_page_to_iter(bvec->bv_page,
@@ -1127,8 +1129,9 @@ void bio_free_pages(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i)
+   bio_for_each_segment_all(bvec, bio, i, iter_all)
__free_page(bvec->bv_page);
 }
 EXPORT_SYMBOL(bio_free_pages);
@@ -1295,6 +1298,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
struct bio *bio;
int ret;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
if (!iov_iter_count(iter))
return ERR_PTR(-EINVAL);
@@ -1368,7 +1372,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
return bio;
 
  out_unmap:
-   bio_for_each_segment_all(bvec, bio, j) {
+   bio_for_each_segment_all(bvec, bio, j, iter_all) {
put_page(bvec->bv_page);
}
bio_put(bio);
@@ -1379,11 +1383,12 @@ static void __bio_unmap_user(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
/*
 * make sure we dirty pages we wrote to
 */
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
if (bio_data_dir(bio) == READ)
set_page_dirty_lock(bvec->bv_page);
 
@@ -1475,8 +1480,9 @@ static void bio_copy_kern_endio_read(struct bio *bio)
char *p = bio->bi_private;
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
p += bvec->bv_len;
}
@@ -1585,8 +1591,9 @@ void bio_set_pages_dirty(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
if (!PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
}
@@ -1597,8 +1604,9 @@ static void bio_release_pages(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   

[Cluster-devel] [PATCH V12 13/20] block: loop: pass multi-page bvec to iov_iter

2018-11-25 Thread Ming Lei
iov_iter is implemented on bvec itererator helpers, so it is safe to pass
multi-page bvec to it, and this way is much more efficient than passing one
page in each bvec.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 drivers/block/loop.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 176ab1f28eca..e3683211f12d 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -510,21 +510,22 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 loff_t pos, bool rw)
 {
struct iov_iter iter;
+   struct req_iterator rq_iter;
struct bio_vec *bvec;
struct request *rq = blk_mq_rq_from_pdu(cmd);
struct bio *bio = rq->bio;
struct file *file = lo->lo_backing_file;
+   struct bio_vec tmp;
unsigned int offset;
-   int segments = 0;
+   int nr_bvec = 0;
int ret;
 
+   rq_for_each_bvec(tmp, rq, rq_iter)
+   nr_bvec++;
+
if (rq->bio != rq->biotail) {
-   struct req_iterator iter;
-   struct bio_vec tmp;
 
-   __rq_for_each_bio(bio, rq)
-   segments += bio_segments(bio);
-   bvec = kmalloc_array(segments, sizeof(struct bio_vec),
+   bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
 GFP_NOIO);
if (!bvec)
return -EIO;
@@ -533,10 +534,10 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
/*
 * The bios of the request may be started from the middle of
 * the 'bvec' because of bio splitting, so we can't directly
-* copy bio->bi_iov_vec to new bvec. The rq_for_each_segment
+* copy bio->bi_iov_vec to new bvec. The rq_for_each_bvec
 * API will take care of all details for us.
 */
-   rq_for_each_segment(tmp, rq, iter) {
+   rq_for_each_bvec(tmp, rq, rq_iter) {
*bvec = tmp;
bvec++;
}
@@ -550,11 +551,10 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 */
offset = bio->bi_iter.bi_bvec_done;
bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
-   segments = bio_segments(bio);
}
atomic_set(>ref, 2);
 
-   iov_iter_bvec(, rw, bvec, segments, blk_rq_bytes(rq));
+   iov_iter_bvec(, rw, bvec, nr_bvec, blk_rq_bytes(rq));
iter.iov_offset = offset;
 
cmd->iocb.ki_pos = pos;
-- 
2.9.5



[Cluster-devel] [PATCH V12 11/20] block: introduce bvec_last_segment()

2018-11-25 Thread Ming Lei
BTRFS and guard_bio_eod() need to get the last singlepage segment
from one multipage bvec, so introduce this helper to make them happy.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index d441486db605..ca6e630f88ab 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -131,4 +131,26 @@ static inline bool bvec_iter_advance(const struct bio_vec 
*bv,
.bi_bvec_done   = 0,\
 }
 
+/*
+ * Get the last single-page segment from the multi-page bvec and store it
+ * in @seg
+ */
+static inline void bvec_last_segment(const struct bio_vec *bvec,
+struct bio_vec *seg)
+{
+   unsigned total = bvec->bv_offset + bvec->bv_len;
+   unsigned last_page = (total - 1) / PAGE_SIZE;
+
+   seg->bv_page = nth_page(bvec->bv_page, last_page);
+
+   /* the whole segment is inside the last page */
+   if (bvec->bv_offset >= last_page * PAGE_SIZE) {
+   seg->bv_offset = bvec->bv_offset % PAGE_SIZE;
+   seg->bv_len = bvec->bv_len;
+   } else {
+   seg->bv_offset = 0;
+   seg->bv_len = total - last_page * PAGE_SIZE;
+   }
+}
+
 #endif /* __LINUX_BVEC_ITER_H */
-- 
2.9.5



[Cluster-devel] [PATCH V12 09/20] block: use bio_for_each_bvec() to compute multi-page bvec count

2018-11-25 Thread Ming Lei
First it is more efficient to use bio_for_each_bvec() in both
blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
many multi-page bvecs there are in the bio.

Secondly once bio_for_each_bvec() is used, the bvec may need to be
splitted because its length can be very longer than max segment size,
so we have to split the big bvec into several segments.

Thirdly when splitting multi-page bvec into segments, the max segment
limit may be reached, so the bio split need to be considered under
this situation too.

Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 100 +++---
 1 file changed, 80 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 51ec6ca56a0a..2d8f388d43de 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -161,6 +161,70 @@ static inline unsigned get_max_io_size(struct 
request_queue *q,
return sectors;
 }
 
+static unsigned get_max_segment_size(struct request_queue *q,
+unsigned offset)
+{
+   unsigned long mask = queue_segment_boundary(q);
+
+   return min_t(unsigned long, mask - (mask & offset) + 1,
+queue_max_segment_size(q));
+}
+
+/*
+ * Split the bvec @bv into segments, and update all kinds of
+ * variables.
+ */
+static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
+   unsigned *nsegs, unsigned *last_seg_size,
+   unsigned *front_seg_size, unsigned *sectors)
+{
+   unsigned len = bv->bv_len;
+   unsigned total_len = 0;
+   unsigned new_nsegs = 0, seg_size = 0;
+
+   /*
+* Multipage bvec may be too big to hold in one segment,
+* so the current bvec has to be splitted as multiple
+* segments.
+*/
+   while (len && new_nsegs + *nsegs < queue_max_segments(q)) {
+   seg_size = get_max_segment_size(q, bv->bv_offset + total_len);
+   seg_size = min(seg_size, len);
+
+   new_nsegs++;
+   total_len += seg_size;
+   len -= seg_size;
+
+   if ((bv->bv_offset + total_len) & queue_virt_boundary(q))
+   break;
+   }
+
+   if (!new_nsegs)
+   return !!len;
+
+   /* update front segment size */
+   if (!*nsegs) {
+   unsigned first_seg_size;
+
+   if (new_nsegs == 1)
+   first_seg_size = get_max_segment_size(q, bv->bv_offset);
+   else
+   first_seg_size = queue_max_segment_size(q);
+
+   if (*front_seg_size < first_seg_size)
+   *front_seg_size = first_seg_size;
+   }
+
+   /* update other varibles */
+   *last_seg_size = seg_size;
+   *nsegs += new_nsegs;
+   if (sectors)
+   *sectors += total_len >> 9;
+
+   /* split in the middle of the bvec if len != 0 */
+   return !!len;
+}
+
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 struct bio *bio,
 struct bio_set *bs,
@@ -174,7 +238,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
 
-   bio_for_each_segment(bv, bio, iter) {
+   bio_for_each_bvec(bv, bio, iter) {
/*
 * If the queue doesn't support SG gaps and adding this
 * offset would create a gap, disallow it.
@@ -189,8 +253,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 */
if (nsegs < queue_max_segments(q) &&
sectors < max_sectors) {
-   nsegs++;
-   sectors = max_sectors;
+   /* split in the middle of bvec */
+   bv.bv_len = (max_sectors - sectors) << 9;
+   bvec_split_segs(q, , ,
+   _size,
+   _seg_size,
+   );
}
goto split;
}
@@ -212,14 +280,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
if (nsegs == queue_max_segments(q))
goto split;
 
-   if (nsegs == 1 && seg_size > front_seg_size)
-   front_seg_size = seg_size;
-
-   nsegs++;
bvprv = bv;
bvprvp = 
-   seg_size = bv.bv_len;
-   sectors += bv.bv_len >> 9;
+
+   if (bvec_split_segs(q, , , _size,
+   _seg_size, ))
+ 

[Cluster-devel] [PATCH V12 10/20] block: use bio_for_each_bvec() to map sg

2018-11-25 Thread Ming Lei
It is more efficient to use bio_for_each_bvec() to map sg, meantime
we have to consider splitting multipage bvec as done in blk_bio_segment_split().

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 70 +++
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2d8f388d43de..20b5b0c3e182 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -462,6 +462,54 @@ static int blk_phys_contig_segment(struct request_queue 
*q, struct bio *bio,
return biovec_phys_mergeable(q, _bv, _bv);
 }
 
+static struct scatterlist *blk_next_sg(struct scatterlist **sg,
+   struct scatterlist *sglist)
+{
+   if (!*sg)
+   return sglist;
+
+   /*
+* If the driver previously mapped a shorter list, we could see a
+* termination bit prematurely unless it fully inits the sg table
+* on each mapping. We KNOW that there must be more entries here
+* or the driver would be buggy, so force clear the termination bit
+* to avoid doing a full sg_init_table() in drivers for each command.
+*/
+   sg_unmark_end(*sg);
+   return sg_next(*sg);
+}
+
+static unsigned blk_bvec_map_sg(struct request_queue *q,
+   struct bio_vec *bvec, struct scatterlist *sglist,
+   struct scatterlist **sg)
+{
+   unsigned nbytes = bvec->bv_len;
+   unsigned nsegs = 0, total = 0, offset = 0;
+
+   while (nbytes > 0) {
+   unsigned seg_size;
+   struct page *pg;
+   unsigned idx;
+
+   *sg = blk_next_sg(sg, sglist);
+
+   seg_size = get_max_segment_size(q, bvec->bv_offset + total);
+   seg_size = min(nbytes, seg_size);
+
+   offset = (total + bvec->bv_offset) % PAGE_SIZE;
+   idx = (total + bvec->bv_offset) / PAGE_SIZE;
+   pg = nth_page(bvec->bv_page, idx);
+
+   sg_set_page(*sg, pg, seg_size, offset);
+
+   total += seg_size;
+   nbytes -= seg_size;
+   nsegs++;
+   }
+
+   return nsegs;
+}
+
 static inline void
 __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 struct scatterlist *sglist, struct bio_vec *bvprv,
@@ -479,25 +527,7 @@ __blk_segment_map_sg(struct request_queue *q, struct 
bio_vec *bvec,
(*sg)->length += nbytes;
} else {
 new_segment:
-   if (!*sg)
-   *sg = sglist;
-   else {
-   /*
-* If the driver previously mapped a shorter
-* list, we could see a termination bit
-* prematurely unless it fully inits the sg
-* table on each mapping. We KNOW that there
-* must be more entries here or the driver
-* would be buggy, so force clear the
-* termination bit to avoid doing a full
-* sg_init_table() in drivers for each command.
-*/
-   sg_unmark_end(*sg);
-   *sg = sg_next(*sg);
-   }
-
-   sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
-   (*nsegs)++;
+   (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
}
*bvprv = *bvec;
 }
@@ -519,7 +549,7 @@ static int __blk_bios_map_sg(struct request_queue *q, 
struct bio *bio,
int nsegs = 0;
 
for_each_bio(bio)
-   bio_for_each_segment(bvec, bio, iter)
+   bio_for_each_bvec(bvec, bio, iter)
__blk_segment_map_sg(q, , sglist, , sg,
 );
 
-- 
2.9.5



[Cluster-devel] [PATCH V12 08/20] block: introduce bio_for_each_bvec() and rq_for_each_bvec()

2018-11-25 Thread Ming Lei
bio_for_each_bvec() is used for iterating over multi-page bvec for bio
split & merge code.

rq_for_each_bvec() can be used for drivers which may handle the
multi-page bvec directly, so far loop is one perfect use case.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bio.h| 10 ++
 include/linux/blkdev.h |  4 
 include/linux/bvec.h   |  7 +++
 3 files changed, 21 insertions(+)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 6a0ff02f4d1c..46fd0e03233b 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -156,6 +156,16 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 #define bio_for_each_segment(bvl, bio, iter)   \
__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
 
+#define __bio_for_each_bvec(bvl, bio, iter, start) \
+   for (iter = (start);\
+(iter).bi_size &&  \
+   ((bvl = bvec_iter_bvec((bio)->bi_io_vec, (iter))), 1); \
+bio_advance_iter((bio), &(iter), (bvl).bv_len))
+
+/* iterate over multi-page bvec */
+#define bio_for_each_bvec(bvl, bio, iter)  \
+   __bio_for_each_bvec(bvl, bio, iter, (bio)->bi_iter)
+
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
 static inline unsigned bio_segments(struct bio *bio)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 399a7a415609..fa263de3f1d1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -799,6 +799,10 @@ struct req_iterator {
__rq_for_each_bio(_iter.bio, _rq)   \
bio_for_each_segment(bvl, _iter.bio, _iter.iter)
 
+#define rq_for_each_bvec(bvl, _rq, _iter)  \
+   __rq_for_each_bio(_iter.bio, _rq)   \
+   bio_for_each_bvec(bvl, _iter.bio, _iter.iter)
+
 #define rq_iter_last(bvec, _iter)  \
(_iter.bio->bi_next == NULL &&  \
 bio_iter_last(bvec, _iter.iter))
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index babc6316c117..d441486db605 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -65,6 +65,13 @@ struct bvec_iter {
 #define bvec_iter_page_idx(bvec, iter) \
(bvec_iter_offset((bvec), (iter)) / PAGE_SIZE)
 
+#define bvec_iter_bvec(bvec, iter) \
+((struct bio_vec) {\
+   .bv_page= bvec_iter_page((bvec), (iter)),   \
+   .bv_len = bvec_iter_len((bvec), (iter)),\
+   .bv_offset  = bvec_iter_offset((bvec), (iter)), \
+})
+
 /* For building single-page bvec(segment) in flight */
  #define segment_iter_offset(bvec, iter)   \
(bvec_iter_offset((bvec), (iter)) % PAGE_SIZE)
-- 
2.9.5



[Cluster-devel] [PATCH V12 02/20] btrfs: look at bi_size for repair decisions

2018-11-25 Thread Ming Lei
From: Christoph Hellwig 

bio_readpage_error currently uses bi_vcnt to decide if it is worth
retrying an I/O.  But the vector count is mostly an implementation
artifact - it really should figure out if there is more than a
single sector worth retrying.  Use bi_size for that and shift by
PAGE_SHIFT.  This really should be blocks/sectors, but given that
btrfs doesn't support a sector size different from the PAGE_SIZE
using the page size keeps the changes to a minimum.

Reviewed-by: David Sterba 
Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/extent_io.c | 2 +-
 include/linux/bio.h  | 6 --
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 15fd46582bb2..40751e86a2a9 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2368,7 +2368,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
int read_mode = 0;
blk_status_t status;
int ret;
-   unsigned failed_bio_pages = bio_pages_all(failed_bio);
+   unsigned failed_bio_pages = failed_bio->bi_iter.bi_size >> PAGE_SHIFT;
 
BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 056fb627edb3..6f6bc331a5d1 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -263,12 +263,6 @@ static inline void bio_get_last_bvec(struct bio *bio, 
struct bio_vec *bv)
bv->bv_len = iter.bi_bvec_done;
 }
 
-static inline unsigned bio_pages_all(struct bio *bio)
-{
-   WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
-   return bio->bi_vcnt;
-}
-
 static inline struct bio_vec *bio_first_bvec_all(struct bio *bio)
 {
WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
-- 
2.9.5



[Cluster-devel] [PATCH V12 06/20] block: rename bvec helpers

2018-11-25 Thread Ming Lei
We will support multi-page bvec soon, and have to deal with
single-page vs multi-page bvec. This patch follows Christoph's
suggestion to rename all the following helpers:

for_each_bvec
bvec_iter_bvec
bvec_iter_len
bvec_iter_page
bvec_iter_offset

into:
for_each_segment
segment_iter_bvec
segment_iter_len
segment_iter_page
segment_iter_offset

so that these helpers named with 'segment' only deal with single-page
bvec, or called segment. We will introduce helpers named with 'bvec'
for multi-page bvec.

bvec_iter_advance() isn't renamed becasue this helper is always operated
on real bvec even though multi-page bvec is supported.

Suggested-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 .clang-format  |  2 +-
 drivers/md/dm-integrity.c  |  2 +-
 drivers/md/dm-io.c |  4 ++--
 drivers/nvdimm/blk.c   |  4 ++--
 drivers/nvdimm/btt.c   |  4 ++--
 include/linux/bio.h| 10 +-
 include/linux/bvec.h   | 20 +++-
 include/linux/ceph/messenger.h |  2 +-
 lib/iov_iter.c |  2 +-
 net/ceph/messenger.c   | 14 +++---
 10 files changed, 33 insertions(+), 31 deletions(-)

diff --git a/.clang-format b/.clang-format
index e6080f5834a3..049200fbab94 100644
--- a/.clang-format
+++ b/.clang-format
@@ -120,7 +120,7 @@ ForEachMacros:
   - 'for_each_available_child_of_node'
   - 'for_each_bio'
   - 'for_each_board_func_rsrc'
-  - 'for_each_bvec'
+  - 'for_each_segment'
   - 'for_each_child_of_node'
   - 'for_each_clear_bit'
   - 'for_each_clear_bit_from'
diff --git a/drivers/md/dm-integrity.c b/drivers/md/dm-integrity.c
index bb3096bf2cc6..bb037ed2b4eb 100644
--- a/drivers/md/dm-integrity.c
+++ b/drivers/md/dm-integrity.c
@@ -1574,7 +1574,7 @@ static bool __journal_read_write(struct dm_integrity_io 
*dio, struct bio *bio,
char *tag_ptr = journal_entry_tag(ic, je);
 
if (bip) do {
-   struct bio_vec biv = 
bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+   struct bio_vec biv = 
segment_iter_bvec(bip->bip_vec, bip->bip_iter);
unsigned tag_now = min(biv.bv_len, 
tag_todo);
char *tag_addr;
BUG_ON(PageHighMem(biv.bv_page));
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index 81ffc59d05c9..d72ec2bdd333 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -208,8 +208,8 @@ static void list_dp_init(struct dpages *dp, struct 
page_list *pl, unsigned offse
 static void bio_get_page(struct dpages *dp, struct page **p,
 unsigned long *len, unsigned *offset)
 {
-   struct bio_vec bvec = bvec_iter_bvec((struct bio_vec *)dp->context_ptr,
-dp->context_bi);
+   struct bio_vec bvec = segment_iter_bvec((struct bio_vec 
*)dp->context_ptr,
+   dp->context_bi);
 
*p = bvec.bv_page;
*len = bvec.bv_len;
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index db45c6bbb7bb..dfae945216bb 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -89,9 +89,9 @@ static int nd_blk_rw_integrity(struct nd_namespace_blk *nsblk,
struct bio_vec bv;
void *iobuf;
 
-   bv = bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+   bv = segment_iter_bvec(bip->bip_vec, bip->bip_iter);
/*
-* The 'bv' obtained from bvec_iter_bvec has its .bv_len and
+* The 'bv' obtained from segment_iter_bvec has its .bv_len and
 * .bv_offset already adjusted for iter->bi_bvec_done, and we
 * can use those directly
 */
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index b123b0dcf274..2bbbc90c7b91 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1154,9 +1154,9 @@ static int btt_rw_integrity(struct btt *btt, struct 
bio_integrity_payload *bip,
struct bio_vec bv;
void *mem;
 
-   bv = bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
+   bv = segment_iter_bvec(bip->bip_vec, bip->bip_iter);
/*
-* The 'bv' obtained from bvec_iter_bvec has its .bv_len and
+* The 'bv' obtained from segment_iter_bvec has its .bv_len and
 * .bv_offset already adjusted for iter->bi_bvec_done, and we
 * can use those directly
 */
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 6f6bc331a5d1..6a0ff02f4d1c 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -48,14 +48,14 @@
 #define bio_set_prio(bio, prio)(

[Cluster-devel] [PATCH V12 07/20] block: introduce multi-page bvec helpers

2018-11-25 Thread Ming Lei
This patch introduces helpers of 'bvec_iter_*' for multi-page bvec
support.

The introduced helpers treate one bvec as real multi-page segment,
which may include more than one pages.

The existed helpers of bvec_iter_* are interfaces for supporting current
bvec iterator which is thought as single-page by drivers, fs, dm and
etc. These introduced helpers will build single-page bvec in flight, so
this way won't break current bio/bvec users, which needn't any change.

Follows some multi-page bvec background:

- bvecs stored in bio->bi_io_vec is always multi-page style

- bvec(struct bio_vec) represents one physically contiguous I/O
  buffer, now the buffer may include more than one page after
  multi-page bvec is supported, and all these pages represented
  by one bvec is physically contiguous. Before multi-page bvec
  support, at most one page is included in one bvec, we call it
  single-page bvec.

- .bv_page of the bvec points to the 1st page in the multi-page bvec

- .bv_offset of the bvec is the offset of the buffer in the bvec

The effect on the current drivers/filesystem/dm/bcache/...:

- almost everyone supposes that one bvec only includes one single
  page, so we keep the sp interface not changed, for example,
  bio_for_each_segment() still returns single-page bvec

- bio_for_each_segment_all() will return single-page bvec too

- during iterating, iterator variable(struct bvec_iter) is always
  updated in multi-page bvec style, and bvec_iter_advance() is kept
  not changed

- returned(copied) single-page bvec is built in flight by bvec
  helpers from the stored multi-page bvec

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 25 -
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 716a87b26a6a..babc6316c117 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -50,16 +51,32 @@ struct bvec_iter {
  */
 #define __bvec_iter_bvec(bvec, iter)   (&(bvec)[(iter).bi_idx])
 
-#define segment_iter_page(bvec, iter)  \
+/* multi-page (segment) helpers */
+#define bvec_iter_page(bvec, iter) \
(__bvec_iter_bvec((bvec), (iter))->bv_page)
 
-#define segment_iter_len(bvec, iter)   \
+#define bvec_iter_len(bvec, iter)  \
min((iter).bi_size, \
__bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
 
-#define segment_iter_offset(bvec, iter)\
+#define bvec_iter_offset(bvec, iter)   \
(__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
 
+#define bvec_iter_page_idx(bvec, iter) \
+   (bvec_iter_offset((bvec), (iter)) / PAGE_SIZE)
+
+/* For building single-page bvec(segment) in flight */
+ #define segment_iter_offset(bvec, iter)   \
+   (bvec_iter_offset((bvec), (iter)) % PAGE_SIZE)
+
+#define segment_iter_len(bvec, iter)   \
+   min_t(unsigned, bvec_iter_len((bvec), (iter)),  \
+ PAGE_SIZE - segment_iter_offset((bvec), (iter)))
+
+#define segment_iter_page(bvec, iter)  \
+   nth_page(bvec_iter_page((bvec), (iter)),\
+bvec_iter_page_idx((bvec), (iter)))
+
 #define segment_iter_bvec(bvec, iter)  \
 ((struct bio_vec) {\
.bv_page= segment_iter_page((bvec), (iter)),\
@@ -67,8 +84,6 @@ struct bvec_iter {
.bv_offset  = segment_iter_offset((bvec), (iter)),  \
 })
 
-#define bvec_iter_len  segment_iter_len
-
 static inline bool bvec_iter_advance(const struct bio_vec *bv,
struct bvec_iter *iter, unsigned bytes)
 {
-- 
2.9.5



[Cluster-devel] [PATCH V12 05/20] block: remove bvec_iter_rewind()

2018-11-25 Thread Ming Lei
Commit 7759eb23fd980 ("block: remove bio_rewind_iter()") removes
bio_rewind_iter(), then no one uses bvec_iter_rewind() any more,
so remove it.

Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 24 
 1 file changed, 24 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 02c73c6aa805..ba0ae40e77c9 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -92,30 +92,6 @@ static inline bool bvec_iter_advance(const struct bio_vec 
*bv,
return true;
 }
 
-static inline bool bvec_iter_rewind(const struct bio_vec *bv,
-struct bvec_iter *iter,
-unsigned int bytes)
-{
-   while (bytes) {
-   unsigned len = min(bytes, iter->bi_bvec_done);
-
-   if (iter->bi_bvec_done == 0) {
-   if (WARN_ONCE(iter->bi_idx == 0,
- "Attempted to rewind iter beyond "
- "bvec's boundaries\n")) {
-   return false;
-   }
-   iter->bi_idx--;
-   iter->bi_bvec_done = __bvec_iter_bvec(bv, 
*iter)->bv_len;
-   continue;
-   }
-   bytes -= len;
-   iter->bi_size += len;
-   iter->bi_bvec_done -= len;
-   }
-   return true;
-}
-
 #define for_each_bvec(bvl, bio_vec, iter, start)   \
for (iter = (start);\
 (iter).bi_size &&  \
-- 
2.9.5



[Cluster-devel] [PATCH V12 04/20] block: don't use bio->bi_vcnt to figure out segment number

2018-11-25 Thread Ming Lei
It is wrong to use bio->bi_vcnt to figure out how many segments
there are in the bio even though CLONED flag isn't set on this bio,
because this bio may be splitted or advanced.

So always use bio_segments() in blk_recount_segments(), and it shouldn't
cause any performance loss now because the physical segment number is figured
out in blk_queue_split() and BIO_SEG_VALID is set meantime since
bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").

Reviewed-by: Christoph Hellwig 
Fixes: 76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max 
segments")
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index e69d8f8ba819..51ec6ca56a0a 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -367,13 +367,7 @@ void blk_recalc_rq_segments(struct request *rq)
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-   unsigned short seg_cnt;
-
-   /* estimate segment number by bi_vcnt for non-cloned bio */
-   if (bio_flagged(bio, BIO_CLONED))
-   seg_cnt = bio_segments(bio);
-   else
-   seg_cnt = bio->bi_vcnt;
+   unsigned short seg_cnt = bio_segments(bio);
 
if (test_bit(QUEUE_FLAG_NO_SG_MERGE, >queue_flags) &&
(seg_cnt < queue_max_segments(q)))
-- 
2.9.5



[Cluster-devel] [PATCH V12 03/20] block: remove the "cluster" flag

2018-11-25 Thread Ming Lei
From: Christoph Hellwig 

The cluster flag implements some very old SCSI behavior.  As far as I
can tell the original intent was to enable or disable any kind of
segment merging.  But the actually visible effect to the LLDD is that
it limits each segments to be inside a single page, which we can
also affect by setting the maximum segment size and the segment
boundary.

Signed-off-by: Christoph Hellwig 

Replace virt boundary with segment boundary limit.

Signed-off-by: Ming Lei 
---
 block/blk-merge.c   | 20 
 block/blk-settings.c|  3 ---
 block/blk-sysfs.c   |  5 +
 drivers/scsi/scsi_lib.c | 20 
 include/linux/blkdev.h  |  6 --
 5 files changed, 25 insertions(+), 29 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 6be04ef8da5b..e69d8f8ba819 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -195,7 +195,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
goto split;
}
 
-   if (bvprvp && blk_queue_cluster(q)) {
+   if (bvprvp) {
if (seg_size + bv.bv_len > queue_max_segment_size(q))
goto new_segment;
if (!biovec_phys_mergeable(q, bvprvp, ))
@@ -295,10 +295,10 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
 bool no_sg_merge)
 {
struct bio_vec bv, bvprv = { NULL };
-   int cluster, prev = 0;
unsigned int seg_size, nr_phys_segs;
struct bio *fbio, *bbio;
struct bvec_iter iter;
+   bool prev = false;
 
if (!bio)
return 0;
@@ -313,7 +313,6 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
}
 
fbio = bio;
-   cluster = blk_queue_cluster(q);
seg_size = 0;
nr_phys_segs = 0;
for_each_bio(bio) {
@@ -325,7 +324,7 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
if (no_sg_merge)
goto new_segment;
 
-   if (prev && cluster) {
+   if (prev) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
goto new_segment;
@@ -343,7 +342,7 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
 
nr_phys_segs++;
bvprv = bv;
-   prev = 1;
+   prev = true;
seg_size = bv.bv_len;
}
bbio = bio;
@@ -396,9 +395,6 @@ static int blk_phys_contig_segment(struct request_queue *q, 
struct bio *bio,
 {
struct bio_vec end_bv = { NULL }, nxt_bv;
 
-   if (!blk_queue_cluster(q))
-   return 0;
-
if (bio->bi_seg_back_size + nxt->bi_seg_front_size >
queue_max_segment_size(q))
return 0;
@@ -415,12 +411,12 @@ static int blk_phys_contig_segment(struct request_queue 
*q, struct bio *bio,
 static inline void
 __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 struct scatterlist *sglist, struct bio_vec *bvprv,
-struct scatterlist **sg, int *nsegs, int *cluster)
+struct scatterlist **sg, int *nsegs)
 {
 
int nbytes = bvec->bv_len;
 
-   if (*sg && *cluster) {
+   if (*sg) {
if ((*sg)->length + nbytes > queue_max_segment_size(q))
goto new_segment;
if (!biovec_phys_mergeable(q, bvprv, bvec))
@@ -466,12 +462,12 @@ static int __blk_bios_map_sg(struct request_queue *q, 
struct bio *bio,
 {
struct bio_vec bvec, bvprv = { NULL };
struct bvec_iter iter;
-   int cluster = blk_queue_cluster(q), nsegs = 0;
+   int nsegs = 0;
 
for_each_bio(bio)
bio_for_each_segment(bvec, bio, iter)
__blk_segment_map_sg(q, , sglist, , sg,
-, );
+);
 
return nsegs;
 }
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 3abe831e92c8..3e7038e475ee 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -56,7 +56,6 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->alignment_offset = 0;
lim->io_opt = 0;
lim->misaligned = 0;
-   lim->cluster = 1;
lim->zoned = BLK_ZONED_NONE;
 }
 EXPORT_SYMBOL(blk_set_default_limits);
@@ -547,8 +546,6 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
t->io_min = max(t->io_min, b->io_min);
t->io_opt = lcm_not_zero(t->io_opt, b->io_opt);
 
-   t->cluster 

[Cluster-devel] [PATCH V12 01/20] btrfs: remove various bio_offset arguments

2018-11-25 Thread Ming Lei
From: Christoph Hellwig 

The btrfs write path passes a bio_offset argument through some deep
callchains including async offloading.  In the end this is easily
calculatable using page_offset plus the bvec offset for the first
page in the bio, and only actually used by by a single function.
Just move the calculation of the offset there.

Reviewed-by: David Sterba 
Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/disk-io.c   | 21 +
 fs/btrfs/disk-io.h   |  2 +-
 fs/btrfs/extent_io.c |  9 ++---
 fs/btrfs/extent_io.h |  5 ++---
 fs/btrfs/inode.c | 17 -
 5 files changed, 18 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3f0b6d1936e8..169839487ac9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -108,11 +108,6 @@ struct async_submit_bio {
struct bio *bio;
extent_submit_bio_start_t *submit_bio_start;
int mirror_num;
-   /*
-* bio_offset is optional, can be used if the pages in the bio
-* can't tell us where in the file the bio should go
-*/
-   u64 bio_offset;
struct btrfs_work work;
blk_status_t status;
 };
@@ -754,8 +749,7 @@ static void run_one_async_start(struct btrfs_work *work)
blk_status_t ret;
 
async = container_of(work, struct  async_submit_bio, work);
-   ret = async->submit_bio_start(async->private_data, async->bio,
- async->bio_offset);
+   ret = async->submit_bio_start(async->private_data, async->bio);
if (ret)
async->status = ret;
 }
@@ -786,7 +780,7 @@ static void run_one_async_free(struct btrfs_work *work)
 
 blk_status_t btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct bio 
*bio,
 int mirror_num, unsigned long bio_flags,
-u64 bio_offset, void *private_data,
+void *private_data,
 extent_submit_bio_start_t *submit_bio_start)
 {
struct async_submit_bio *async;
@@ -803,8 +797,6 @@ blk_status_t btrfs_wq_submit_bio(struct btrfs_fs_info 
*fs_info, struct bio *bio,
btrfs_init_work(>work, btrfs_worker_helper, run_one_async_start,
run_one_async_done, run_one_async_free);
 
-   async->bio_offset = bio_offset;
-
async->status = 0;
 
if (op_is_sync(bio->bi_opf))
@@ -831,8 +823,7 @@ static blk_status_t btree_csum_one_bio(struct bio *bio)
return errno_to_blk_status(ret);
 }
 
-static blk_status_t btree_submit_bio_start(void *private_data, struct bio *bio,
-u64 bio_offset)
+static blk_status_t btree_submit_bio_start(void *private_data, struct bio *bio)
 {
/*
 * when we're called for a write, we're already in the async
@@ -853,8 +844,7 @@ static int check_async_write(struct btrfs_inode *bi)
 }
 
 static blk_status_t btree_submit_bio_hook(void *private_data, struct bio *bio,
- int mirror_num, unsigned long 
bio_flags,
- u64 bio_offset)
+ int mirror_num, unsigned long 
bio_flags)
 {
struct inode *inode = private_data;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -882,8 +872,7 @@ static blk_status_t btree_submit_bio_hook(void 
*private_data, struct bio *bio,
 * checksumming can happen in parallel across all CPUs
 */
ret = btrfs_wq_submit_bio(fs_info, bio, mirror_num, 0,
- bio_offset, private_data,
- btree_submit_bio_start);
+ private_data, btree_submit_bio_start);
}
 
if (ret)
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 4cccba22640f..b48b3ec353fc 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -119,7 +119,7 @@ blk_status_t btrfs_bio_wq_end_io(struct btrfs_fs_info 
*info, struct bio *bio,
enum btrfs_wq_endio_type metadata);
 blk_status_t btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct bio 
*bio,
int mirror_num, unsigned long bio_flags,
-   u64 bio_offset, void *private_data,
+   void *private_data,
extent_submit_bio_start_t *submit_bio_start);
 blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio,
  int mirror_num);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d228f706ff3e..15fd46582bb2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2397,7 +2397,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
read_mode, failrec->this_mirror, failrec->in_validation);
 
status = tree->ops->submit_bio_hook(tree->private_data, 

[Cluster-devel] [PATCH V12 00/20] block: support multi-page bvec

2018-11-25 Thread Ming Lei
or using bio_for_each_chunk_segment_all()
- address Kent's comment

V5:
- remove some of prepare patches, which have been merged already
- add bio_clone_seg_bioset() to fix DM's bio clone, which
is introduced by 18a25da84354c6b (dm: ensure bio submission follows
a depth-first tree walk)
- rebase on the latest block for-v4.18

V4:
- rename bio_for_each_segment*() as bio_for_each_page*(), rename
bio_segments() as bio_pages(), rename rq_for_each_segment() as
rq_for_each_pages(), because these helpers never return real
segment, and they always return single page bvec

- introducing segment_for_each_page_all()

- introduce new 
bio_for_each_segment*()/rq_for_each_segment()/bio_segments()
for returning real multipage segment

- rewrite segment_last_page()

- rename bvec iterator helper as suggested by Christoph

- replace comment with applying bio helpers as suggested by Christoph

- document usage of bio iterator helpers

- redefine BIO_MAX_PAGES as 256 to make the biggest bvec table
accommodated in 4K page

- move bio_alloc_pages() into bcache as suggested by Christoph

V3:
- rebase on v4.13-rc3 with for-next of block tree
- run more xfstests: xfs/ext4 over NVMe, Sata, DM(linear),
MD(raid1), and not see regressions triggered
- add Reviewed-by on some btrfs patches
- remove two MD patches because both are merged to linus tree
  already

V2:
- bvec table direct access in raid has been cleaned, so NO_MP
flag is dropped
- rebase on recent Neil Brown's change on bio and bounce code
- reorganize the patchset

V1:
- against v4.10-rc1 and some cleanup in V0 are in -linus already
- handle queue_virt_boundary() in mp bvec change and make NVMe happy
- further BTRFS cleanup
- remove QUEUE_FLAG_SPLIT_MP
- rename for two new helpers of bio_for_each_segment_all()
- fix bounce convertion
- address comments in V0

[1], http://marc.info/?l=linux-kernel=141680246629547=2
[2], https://patchwork.kernel.org/patch/9451523/
[3], http://marc.info/?t=14773544711=1=2
[4], http://marc.info/?l=linux-mm=147745525801433=2
[5], http://marc.info/?t=14956948457=1=2
[6], http://marc.info/?t=14982021534=1=2




Christoph Hellwig (3):
  btrfs: remove various bio_offset arguments
  btrfs: look at bi_size for repair decisions
  block: remove the "cluster" flag

Ming Lei (17):
  block: don't use bio->bi_vcnt to figure out segment number
  block: remove bvec_iter_rewind()
  block: rename bvec helpers
  block: introduce multi-page bvec helpers
  block: introduce bio_for_each_bvec() and rq_for_each_bvec()
  block: use bio_for_each_bvec() to compute multi-page bvec count
  block: use bio_for_each_bvec() to map sg
  block: introduce bvec_last_segment()
  fs/buffer.c: use bvec iterator to truncate the bio
  block: loop: pass multi-page bvec to iov_iter
  bcache: avoid to use bio_for_each_segment_all() in
bch_bio_alloc_pages()
  block: allow bio_for_each_segment_all() to iterate over multi-page
bvec
  block: enable multipage bvecs
  block: always define BIO_MAX_PAGES as 256
  block: document usage of bio iterator helpers
  block: kill QUEUE_FLAG_NO_SG_MERGE
  block: kill BLK_MQ_F_SG_MERGE

 .clang-format |   2 +-
 Documentation/block/biovecs.txt   |  25 +
 block/bio.c   |  49 +---
 block/blk-merge.c | 227 --
 block/blk-mq-debugfs.c|   2 -
 block/blk-mq.c|   3 -
 block/blk-settings.c  |   3 -
 block/blk-sysfs.c |   5 +-
 block/bounce.c|   6 +-
 drivers/block/loop.c  |  22 ++--
 drivers/block/nbd.c   |   2 +-
 drivers/block/rbd.c   |   2 +-
 drivers/block/skd_main.c  |   1 -
 drivers/block/xen-blkfront.c  |   2 +-
 drivers/md/bcache/btree.c |   3 +-
 drivers/md/bcache/util.c  |   6 +-
 drivers/md/dm-crypt.c |   3 +-
 drivers/md/dm-integrity.c |   2 +-
 drivers/md/dm-io.c|   4 +-
 drivers/md/dm-rq.c|   2 +-
 drivers/md/dm-table.c |  13 ---
 drivers/md/raid1.c|   3 +-
 drivers/mmc/core/queue.c  |   3 +-
 drivers/nvdimm/blk.c  |   4 +-
 drivers/nvdimm/btt.c  |   4 +-
 drivers/scsi/scsi_lib.c   |  22 +++-
 drivers/staging/erofs/data.c  |   3 +-
 drivers/staging/erofs/unzip_vle.c |   3 +-
 fs/block_dev.c|   6 +-
 fs/btrfs/compression.c|   3 +-
 fs/btrfs/disk-io.c|  24 ++--
 fs/btrfs/disk-io.h|   2 +-
 fs/btrfs/extent_io.c  |  20 ++--
 fs/btrfs/extent_io.h  |   5 +-
 fs/btrfs/inode.c  |  23 ++

Re: [Cluster-devel] [PATCH V11 15/19] block: enable multipage bvecs

2018-11-23 Thread Ming Lei
On Wed, Nov 21, 2018 at 05:12:06PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 11:48:13PM +0800, Ming Lei wrote:
> > I guess the correct check should be:
> > 
> > end_addr = vec_addr + bv->bv_offset + bv->bv_len;
> > if (same_page &&
> > (end_addr & PAGE_MASK) != (page_addr & PAGE_MASK))
> > return false;
> 
> Indeed.

The above is still not totally correct, and it should have been:

end_addr = vec_addr + bv->bv_offset + bv->bv_len - 1;
if (same_page && (end_addr & PAGE_MASK) != page_addr)
return false;

Also bv->bv_len should be guaranteed as being bigger than zero.

It also shows that it is quite easy to figure out the last page as
wrong, :-(


Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 12/19] block: allow bio_for_each_segment_all() to iterate over multi-page bvec

2018-11-22 Thread Ming Lei
On Thu, Nov 22, 2018 at 12:03:15PM +0100, Christoph Hellwig wrote:
> > +/* used for chunk_for_each_segment */
> > +static inline void bvec_next_segment(const struct bio_vec *bvec,
> > +struct bvec_iter_all *iter_all)
> 
> FYI, chunk_for_each_segment doesn't exist anymore, this is
> bvec_for_each_segment now.  Not sure the comment helps much, though.

OK, will remove the comment.

> 
> > +{
> > +   struct bio_vec *bv = _all->bv;
> > +
> > +   if (bv->bv_page) {
> > +   bv->bv_page += 1;
> 
> I think this needs to use nth_page() given that with discontigmem
> page structures might not be allocated contigously.

Good catch!

Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 07/19] fs/buffer.c: use bvec iterator to truncate the bio

2018-11-22 Thread Ming Lei
On Thu, Nov 22, 2018 at 11:58:49AM +0100, Christoph Hellwig wrote:
> Btw, given that this is the last user of bvec_last_segment after my
> other patches I think we should kill bvec_last_segment and do something
> like this here:
> 
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index fa37ad52e962..af5e135d2b83 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2981,6 +2981,14 @@ static void end_bio_bh_io_sync(struct bio *bio)
>   bio_put(bio);
>  }
>  
> +static void zero_trailing_sectors(struct bio_vec *bvec, unsigned bytes)
> +{
> + unsigned last_page = (bvec->bv_offset + bvec->bv_len - 1) >> PAGE_SHIFT;
> +
> + zero_user(nth_page(bvec->bv_page, last_page),
> +   bvec->bv_offset % PAGE_SIZE + bvec->bv_len, bytes);
> +}

The above 'start' parameter is figured out as wrong, and the computation
isn't very obvious, so I'd suggest to keep bvec_last_segment().

Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 03/19] block: introduce bio_for_each_bvec()

2018-11-22 Thread Ming Lei
On Thu, Nov 22, 2018 at 11:30:33AM +0100, Christoph Hellwig wrote:
> Btw, this patch instead of the plain rever might make it a little
> more clear what is going on by skipping the confusing helper altogher
> and operating on the raw bvec array:
> 
> 
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index e5b975fa0558..926550ce2d21 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -137,24 +137,18 @@ static inline bool bio_full(struct bio *bio)
>   for (i = 0, iter_all.idx = 0; iter_all.idx < (bio)->bi_vcnt; 
> iter_all.idx++)\
>   bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), 
> i, iter_all)
>  
> -static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter 
> *iter,
> -   unsigned bytes, unsigned max_seg_len)
> +static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> + unsigned bytes)
>  {
>   iter->bi_sector += bytes >> 9;
>  
>   if (bio_no_advance_iter(bio))
>   iter->bi_size -= bytes;
>   else
> - __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len);
> + bvec_iter_advance(bio->bi_io_vec, iter, bytes);
>   /* TODO: It is reasonable to complete bio with error here. */
>  }
>  
> -static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> - unsigned bytes)
> -{
> - __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
> -}
> -
>  #define __bio_for_each_segment(bvl, bio, iter, start)
> \
>   for (iter = (start);\
>(iter).bi_size &&  \
> @@ -168,7 +162,7 @@ static inline void bio_advance_iter(struct bio *bio, 
> struct bvec_iter *iter,
>   for (iter = (start);\
>(iter).bi_size &&  \
>   ((bvl = bio_iter_mp_iovec((bio), (iter))), 1);  \
> -  __bio_advance_iter((bio), &(iter), (bvl).bv_len, BVEC_MAX_LEN))
> +  bio_advance_iter((bio), &(iter), (bvl).bv_len))
>  
>  /* returns one real segment(multi-page bvec) each time */
>  #define bio_for_each_bvec(bvl, bio, iter)\
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index cab36d838ed0..7d0f9bdb6f05 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -25,8 +25,6 @@
>  #include 
>  #include 
>  
> -#define BVEC_MAX_LEN  ((unsigned int)-1)
> -
>  /*
>   * was unsigned short, but we might as well be ready for > 64kB I/O pages
>   */
> @@ -102,8 +100,8 @@ struct bvec_iter_all {
>   .bv_offset  = segment_iter_offset((bvec), (iter)),  \
>  })
>  
> -static inline bool __bvec_iter_advance(const struct bio_vec *bv,
> - struct bvec_iter *iter, unsigned bytes, unsigned max_seg_len)
> +static inline bool bvec_iter_advance(const struct bio_vec *bv,
> + struct bvec_iter *iter, unsigned bytes)
>  {
>   if (WARN_ONCE(bytes > iter->bi_size,
>"Attempted to advance past end of bvec iter\n")) {
> @@ -112,20 +110,15 @@ static inline bool __bvec_iter_advance(const struct 
> bio_vec *bv,
>   }
>  
>   while (bytes) {
> - unsigned segment_len = segment_iter_len(bv, *iter);
> -
> - if (max_seg_len < BVEC_MAX_LEN)
> - segment_len = min_t(unsigned, segment_len,
> - max_seg_len -
> - bvec_iter_offset(bv, *iter));
> + const struct bio_vec *cur = bv + iter->bi_idx;
> + unsigned len = min3(bytes, iter->bi_size,
> + cur->bv_len - iter->bi_bvec_done);
>  
> - segment_len = min(bytes, segment_len);
> -
> - bytes -= segment_len;
> - iter->bi_size -= segment_len;
> - iter->bi_bvec_done += segment_len;
> + bytes -= len;
> + iter->bi_size -= len;
> + iter->bi_bvec_done += len;
>  
> - if (iter->bi_bvec_done == __bvec_iter_bvec(bv, *iter)->bv_len) {
> + if (iter->bi_bvec_done == cur->bv_len) {
>   iter->bi_bvec_done = 0;
>   iter->bi_idx++;
>   }

I'd rather not do the optimization part in this patchset, given it doesn't
belong to this patchset, and it may decrease readability. So I plan to revert
the delta part in V12 first.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 14/19] block: handle non-cluster bio out of blk_bio_segment_split

2018-11-22 Thread Ming Lei
On Thu, Nov 22, 2018 at 11:41:50AM +0100, Christoph Hellwig wrote:
> On Thu, Nov 22, 2018 at 06:32:09PM +0800, Ming Lei wrote:
> > On Thu, Nov 22, 2018 at 11:04:28AM +0100, Christoph Hellwig wrote:
> > > On Thu, Nov 22, 2018 at 05:33:00PM +0800, Ming Lei wrote:
> > > > However, using virt boundary limit on non-cluster seems over-kill,
> > > > because the bio will be over-split(each small bvec may be split as one 
> > > > bio)
> > > > if it includes lots of small segment.
> > > 
> > > The combination of the virt boundary of PAGE_SIZE - 1 and a
> > > max_segment_size of PAGE_SIZE will only split if the to me merged
> > > segment is in a different page than the previous one, which is exactly
> > > what we need here.  Multiple small bvec inside the same page (e.g.
> > > 512 byte buffer_heads) will still be merged.
> > > 
> > > > What we want to do is just to avoid to merge bvecs to segment, which
> > > > should have been done by NO_SG_MERGE simply. However, after multi-page
> > > > is enabled, two adjacent bvecs won't be merged any more, I just forget
> > > > to remove the bvec merge code in V11.
> > > > 
> > > > So seems we can simply avoid to use virt boundary limit for non-cluster
> > > > after multipage bvec is enabled?
> > > 
> > > No, we can't just remove it.  As explained in the patch there is one very
> > > visible difference of setting the flag amd that is no segment will span a
> > > page boundary, and at least the iSCSI code seems to rely on that.
> > 
> > IMO, we should use queue_segment_boundary() to enhance the rule during 
> > splitting
> > segment after multi-page bvec is enabled.
> > 
> > Seems we miss the segment boundary limit in bvec_split_segs().
> 
> Yes, that looks like the right fix!

Then your patch should work by just replacing virt boundary with segment
boudary limit. I will do that change in V12 if you don't object.


Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 14/19] block: handle non-cluster bio out of blk_bio_segment_split

2018-11-22 Thread Ming Lei
On Thu, Nov 22, 2018 at 11:04:28AM +0100, Christoph Hellwig wrote:
> On Thu, Nov 22, 2018 at 05:33:00PM +0800, Ming Lei wrote:
> > However, using virt boundary limit on non-cluster seems over-kill,
> > because the bio will be over-split(each small bvec may be split as one bio)
> > if it includes lots of small segment.
> 
> The combination of the virt boundary of PAGE_SIZE - 1 and a
> max_segment_size of PAGE_SIZE will only split if the to me merged
> segment is in a different page than the previous one, which is exactly
> what we need here.  Multiple small bvec inside the same page (e.g.
> 512 byte buffer_heads) will still be merged.
> 
> > What we want to do is just to avoid to merge bvecs to segment, which
> > should have been done by NO_SG_MERGE simply. However, after multi-page
> > is enabled, two adjacent bvecs won't be merged any more, I just forget
> > to remove the bvec merge code in V11.
> > 
> > So seems we can simply avoid to use virt boundary limit for non-cluster
> > after multipage bvec is enabled?
> 
> No, we can't just remove it.  As explained in the patch there is one very
> visible difference of setting the flag amd that is no segment will span a
> page boundary, and at least the iSCSI code seems to rely on that.

IMO, we should use queue_segment_boundary() to enhance the rule during splitting
segment after multi-page bvec is enabled.

Seems we miss the segment boundary limit in bvec_split_segs().

Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 14/19] block: handle non-cluster bio out of blk_bio_segment_split

2018-11-22 Thread Ming Lei
On Thu, Nov 22, 2018 at 11:04:28AM +0100, Christoph Hellwig wrote:
> On Thu, Nov 22, 2018 at 05:33:00PM +0800, Ming Lei wrote:
> > However, using virt boundary limit on non-cluster seems over-kill,
> > because the bio will be over-split(each small bvec may be split as one bio)
> > if it includes lots of small segment.
> 
> The combination of the virt boundary of PAGE_SIZE - 1 and a
> max_segment_size of PAGE_SIZE will only split if the to me merged
> segment is in a different page than the previous one, which is exactly
> what we need here.  Multiple small bvec inside the same page (e.g.
> 512 byte buffer_heads) will still be merged.

Suppose one bio includes (pg0, 0, 512) and (pg1, 512, 512):

The split is introduced by the following code in blk_bio_segment_split():

  if (bvprvp && bvec_gap_to_prev(q, bvprvp, bv.bv_offset))
goto split;

Without this patch, for non-cluster, the two bvecs are just in different
segment, but still handled by one same bio. Now you convert into two bios.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 03/19] block: introduce bio_for_each_bvec()

2018-11-22 Thread Ming Lei
On Wed, Nov 21, 2018 at 06:12:17PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 21, 2018 at 05:10:25PM +0100, Christoph Hellwig wrote:
> > No - I think we can always use the code without any segment in
> > bvec_iter_advance.  Because bvec_iter_advance only operates on the
> > iteractor, the generation of an actual single-page or multi-page
> > bvec is left to the caller using the bvec_iter_bvec or segment_iter_bvec
> > helpers.  The only difference is how many bytes you can move the
> > iterator forward in a single loop iteration - so if you pass in
> > PAGE_SIZE as the max_seg_len you just will have to loop more often
> > for a large enough bytes, but not actually do anything different.
> 
> FYI, this patch reverts the max_seg_len related changes back to where
> we are in mainline, and as expected everything works fine for me:
> 
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index e5b975fa0558..926550ce2d21 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -137,24 +137,18 @@ static inline bool bio_full(struct bio *bio)
>   for (i = 0, iter_all.idx = 0; iter_all.idx < (bio)->bi_vcnt; 
> iter_all.idx++)\
>   bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), 
> i, iter_all)
>  
> -static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter 
> *iter,
> -   unsigned bytes, unsigned max_seg_len)
> +static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> + unsigned bytes)
>  {
>   iter->bi_sector += bytes >> 9;
>  
>   if (bio_no_advance_iter(bio))
>   iter->bi_size -= bytes;
>   else
> - __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len);
> + bvec_iter_advance(bio->bi_io_vec, iter, bytes);
>   /* TODO: It is reasonable to complete bio with error here. */
>  }
>  
> -static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
> - unsigned bytes)
> -{
> - __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
> -}
> -
>  #define __bio_for_each_segment(bvl, bio, iter, start)
> \
>   for (iter = (start);\
>(iter).bi_size &&  \
> @@ -168,7 +162,7 @@ static inline void bio_advance_iter(struct bio *bio, 
> struct bvec_iter *iter,
>   for (iter = (start);\
>(iter).bi_size &&  \
>   ((bvl = bio_iter_mp_iovec((bio), (iter))), 1);  \
> -  __bio_advance_iter((bio), &(iter), (bvl).bv_len, BVEC_MAX_LEN))
> +  bio_advance_iter((bio), &(iter), (bvl).bv_len))
>  
>  /* returns one real segment(multi-page bvec) each time */
>  #define bio_for_each_bvec(bvl, bio, iter)\
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index cab36d838ed0..138b4007b8f2 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -25,8 +25,6 @@
>  #include 
>  #include 
>  
> -#define BVEC_MAX_LEN  ((unsigned int)-1)
> -
>  /*
>   * was unsigned short, but we might as well be ready for > 64kB I/O pages
>   */
> @@ -102,8 +100,8 @@ struct bvec_iter_all {
>   .bv_offset  = segment_iter_offset((bvec), (iter)),  \
>  })
>  
> -static inline bool __bvec_iter_advance(const struct bio_vec *bv,
> - struct bvec_iter *iter, unsigned bytes, unsigned max_seg_len)
> +static inline bool bvec_iter_advance(const struct bio_vec *bv,
> + struct bvec_iter *iter, unsigned bytes)
>  {
>   if (WARN_ONCE(bytes > iter->bi_size,
>"Attempted to advance past end of bvec iter\n")) {
> @@ -112,18 +110,12 @@ static inline bool __bvec_iter_advance(const struct 
> bio_vec *bv,
>   }
>  
>   while (bytes) {
> - unsigned segment_len = segment_iter_len(bv, *iter);
> -
> - if (max_seg_len < BVEC_MAX_LEN)
> - segment_len = min_t(unsigned, segment_len,
> - max_seg_len -
> - bvec_iter_offset(bv, *iter));
> + unsigned iter_len = bvec_iter_len(bv, *iter);
> + unsigned len = min(bytes, iter_len);

It may not work to always use bvec_iter_len() here, and 'segment_len'
should be max length of the passed 'bv', however we don't know if it is
single-page or mutli-page bvec if no one tells us.

Thanks,
Ming



Re: [Cluster-devel] [PATCH V11 14/19] block: handle non-cluster bio out of blk_bio_segment_split

2018-11-22 Thread Ming Lei
On Wed, Nov 21, 2018 at 06:46:21PM +0100, Christoph Hellwig wrote:
> Actually..
> 
> I think we can kill this code entirely.  If we look at what the
> clustering setting is really about it is to avoid ever merging a
> segement that spans a page boundary.  And we should be able to do
> that with something like this before your series:
> 
> ---
> From 0d46fa76c376493a74ea0dbe77305bd5fa2cf011 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig 
> Date: Wed, 21 Nov 2018 18:39:47 +0100
> Subject: block: remove the "cluster" flag
> 
> The cluster flag implements some very old SCSI behavior.  As far as I
> can tell the original intent was to enable or disable any kind of
> segment merging.  But the actually visible effect to the LLDD is that
> it limits each segments to be inside a single page, which we can
> also affect by setting the maximum segment size and the virt
> boundary.

This approach is pretty good given we can do post-split during mapping
sg.

However, using virt boundary limit on non-cluster seems over-kill,
because the bio will be over-split(each small bvec may be split as one bio)
if it includes lots of small segment.

What we want to do is just to avoid to merge bvecs to segment, which
should have been done by NO_SG_MERGE simply. However, after multi-page
is enabled, two adjacent bvecs won't be merged any more, I just forget
to remove the bvec merge code in V11.

So seems we can simply avoid to use virt boundary limit for non-cluster
after multipage bvec is enabled?


thanks,
Ming



  1   2   >