Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Sagi Grimberg




Wait, I see that the bvec is still a single array per bio. When you said
a table I thought you meant a 2-dimentional array...


I mean a new 1-d table A has to be created for multiple bios in one rq,
and build it in the following way

rq_for_each_bvec(tmp, rq, rq_iter)
 *A = tmp;

Then you can pass A to iov_iter_bvec() & send().

Given it is over TCP, I guess it should be doable for you to preallocate one
256-bvec table in one page for each request, then sets the max segment size as
(unsigned int)-1, and max segment number as 256, the preallocated table
should work anytime.


256 bvec table is really a lot to preallocate, especially when its not
needed, I can easily initialize the bvec_iter on the bio bvec. If this
involves preallocation of the worst-case than I don't consider this to
be an improvement.



Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Ming Lei
On Tue, Nov 20, 2018 at 08:42:04PM -0800, Sagi Grimberg wrote:
> 
> > > Yeah, that is the most common example, given merge is enabled
> > > in most of cases. If the driver or device doesn't care merge,
> > > you can disable it and always get single bio request, then the
> > > bio's bvec table can be reused for send().
> > 
> > Does bvec_iter span bvecs with your patches? I didn't see that change?
> 
> Wait, I see that the bvec is still a single array per bio. When you said
> a table I thought you meant a 2-dimentional array...

I mean a new 1-d table A has to be created for multiple bios in one rq,
and build it in the following way

   rq_for_each_bvec(tmp, rq, rq_iter)
*A = tmp;

Then you can pass A to iov_iter_bvec() & send().

Given it is over TCP, I guess it should be doable for you to preallocate one
256-bvec table in one page for each request, then sets the max segment size as
(unsigned int)-1, and max segment number as 256, the preallocated table
should work anytime.


Thanks,
Ming



Re: [Cluster-devel] [PATCH v1 0/4] fs: fix race between llseek SEEK_END and write

2018-11-20 Thread Al Viro
On Wed, Nov 21, 2018 at 11:43:56AM +0900, Eiichi Tsukata wrote:
> Some file systems (including ext4, xfs, ramfs ...) have the following
> problem as I've described in the commit message of the 1/4 patch.
> 
>   The commit ef3d0fd27e90 ("vfs: do (nearly) lockless generic_file_llseek")
>   removed almost all locks in llseek() including SEEK_END. It based on the
>   idea that write() updates size atomically. But in fact, write() can be
>   divided into two or more parts in generic_perform_write() when pos
>   straddles over the PAGE_SIZE, which results in updating size multiple
>   times in one write(). It means that llseek() can see the size being
>   updated during write().

And?  Who has ever promised anything that insane?  write(2) can take an 
arbitrary
amount of time; another process doing lseek() on independently opened descriptor
is *not* going to wait for that (e.g. page-in of the buffer being written, which
just happens to be mmapped from a file on NFS over RFC1149 link).



Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Sagi Grimberg




Yeah, that is the most common example, given merge is enabled
in most of cases. If the driver or device doesn't care merge,
you can disable it and always get single bio request, then the
bio's bvec table can be reused for send().


Does bvec_iter span bvecs with your patches? I didn't see that change?


Wait, I see that the bvec is still a single array per bio. When you said
a table I thought you meant a 2-dimentional array...

Unless I'm not mistaken, I think that the change is pretty simple then.
However, nvme-tcp still needs to be bio aware unless we have some
abstraction in place.. Which will mean that nvme-tcp will need to
open-code bio_bvecs.



Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Sagi Grimberg




I would like to avoid growing bvec tables and keep everything
preallocated. Plus, a bvec_iter operates on a bvec which means
we'll need a table there as well... Not liking it so far...


In case of bios in one request, we can't know how many bvecs there
are except for calling rq_bvecs(), so it may not be suitable to
preallocate the table. If you have to send the IO request in one send(),
runtime allocation may be inevitable.


I don't want to do that, I want to work on a single bvec at a time like
the current implementation does.


If you don't require to send the IO request in one send(), you may send
one bio in one time, and just uses the bio's bvec table directly,
such as the single bio case in lo_rw_aio().


we'd need some indication that we need to reinit my iter with the
new bvec, today we do:

static inline void nvme_tcp_advance_req(struct nvme_tcp_request *req,
int len)
{
req->snd.data_sent += len;
req->pdu_sent += len;
iov_iter_advance(>snd.iter, len);
if (!iov_iter_count(>snd.iter) &&
req->snd.data_sent < req->data_len) {
req->snd.curr_bio = req->snd.curr_bio->bi_next;
nvme_tcp_init_send_iter(req);
}
}

and initialize the send iter. I imagine that now I will need to
switch to the next bvec and only if I'm on the last I need to
use the next bio...

Do you offer an API for that?



can this way avoid your blocking issue? You may see this
example in branch 'rq->bio != rq->biotail' of lo_rw_aio().


This is exactly an example of not ignoring the bios...


Yeah, that is the most common example, given merge is enabled
in most of cases. If the driver or device doesn't care merge,
you can disable it and always get single bio request, then the
bio's bvec table can be reused for send().


Does bvec_iter span bvecs with your patches? I didn't see that change?


I'm not sure how this helps me either. Unless we can set a bvec_iter to
span bvecs or have an abstract bio crossing when we re-initialize the
bvec_iter I don't see how I can ignore bios completely...


rq_for_each_bvec() will iterate over all bvecs from all bios, so you
needn't to see any bio in this req.


But I don't need this iteration, I need a transparent API like;
bvec2 = rq_bvec_next(rq, bvec)

This way I can simply always reinit my iter without thinking about how
the request/bios/bvecs are constructed...


rq_bvecs() will return how many bvecs there are in this request(cover
all bios in this req)


Still not very useful given that I don't want to use a table...


So looks nvme-tcp host driver might be the 2nd driver which benefits
from multi-page bvec directly.

The multi-page bvec V11 has passed my tests and addressed almost
all the comments during review on V10. I removed bio_vecs() in V11,
but it won't be big deal, we can introduce them anytime when there
is the requirement.


multipage-bvecs and nvme-tcp are going to conflict, so it would be good
to coordinate on this. I think that nvme-tcp host needs some adjustments
as setting a bvec_iter. I'm under the impression that the change is rather
small and self-contained, but I'm not sure I have the full
picture here.


I guess I may not get your exact requirement on block io iterator from nvme-tcp
too, :-(


They are pretty much listed above. Today nvme-tcp sets an iterator with:

vec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
nsegs = bio_segments(bio);
size = bio->bi_iter.bi_size;
offset = bio->bi_iter.bi_bvec_done;
iov_iter_bvec(>snd.iter, WRITE, vec, nsegs, size);

and when done, iterate to the next bio and do the same.

With multipage bvec it would be great if we can simply have
something like rq_bvec_next() that would pretty much satisfy
the requirements from the nvme-tcp side...



Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Ming Lei
On Tue, Nov 20, 2018 at 07:20:45PM -0800, Sagi Grimberg wrote:
> 
> > Not sure I understand the 'blocking' problem in this case.
> > 
> > We can build a bvec table from this req, and send them all
> > in send(),
> 
> I would like to avoid growing bvec tables and keep everything
> preallocated. Plus, a bvec_iter operates on a bvec which means
> we'll need a table there as well... Not liking it so far...

In case of bios in one request, we can't know how many bvecs there
are except for calling rq_bvecs(), so it may not be suitable to
preallocate the table. If you have to send the IO request in one send(),
runtime allocation may be inevitable.

If you don't require to send the IO request in one send(), you may send
one bio in one time, and just uses the bio's bvec table directly,
such as the single bio case in lo_rw_aio().

> 
> > can this way avoid your blocking issue? You may see this
> > example in branch 'rq->bio != rq->biotail' of lo_rw_aio().
> 
> This is exactly an example of not ignoring the bios...

Yeah, that is the most common example, given merge is enabled
in most of cases. If the driver or device doesn't care merge,
you can disable it and always get single bio request, then the
bio's bvec table can be reused for send().

> 
> > If this way is what you need, I think you are right, even we may
> > introduce the following helpers:
> > 
> > rq_for_each_bvec()
> > rq_bvecs()
> 
> I'm not sure how this helps me either. Unless we can set a bvec_iter to
> span bvecs or have an abstract bio crossing when we re-initialize the
> bvec_iter I don't see how I can ignore bios completely...

rq_for_each_bvec() will iterate over all bvecs from all bios, so you
needn't to see any bio in this req.

rq_bvecs() will return how many bvecs there are in this request(cover
all bios in this req)

> 
> > So looks nvme-tcp host driver might be the 2nd driver which benefits
> > from multi-page bvec directly.
> > 
> > The multi-page bvec V11 has passed my tests and addressed almost
> > all the comments during review on V10. I removed bio_vecs() in V11,
> > but it won't be big deal, we can introduce them anytime when there
> > is the requirement.
> 
> multipage-bvecs and nvme-tcp are going to conflict, so it would be good
> to coordinate on this. I think that nvme-tcp host needs some adjustments
> as setting a bvec_iter. I'm under the impression that the change is rather
> small and self-contained, but I'm not sure I have the full
> picture here.

I guess I may not get your exact requirement on block io iterator from nvme-tcp
too, :-(

thanks,
Ming



[Cluster-devel] [PATCH V11 19/19] block: kill BLK_MQ_F_SG_MERGE

2018-11-20 Thread Ming Lei
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c   | 1 -
 drivers/block/loop.c | 2 +-
 drivers/block/nbd.c  | 2 +-
 drivers/block/rbd.c  | 2 +-
 drivers/block/skd_main.c | 1 -
 drivers/block/xen-blkfront.c | 2 +-
 drivers/md/dm-rq.c   | 2 +-
 drivers/mmc/core/queue.c | 3 +--
 drivers/scsi/scsi_lib.c  | 2 +-
 include/linux/blk-mq.h   | 1 -
 10 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index d752fe4461af..a6ec055b54fa 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -249,7 +249,6 @@ static const char *const alloc_policy_name[] = {
 static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(SHOULD_MERGE),
HCTX_FLAG_NAME(TAG_SHARED),
-   HCTX_FLAG_NAME(SG_MERGE),
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index e3683211f12d..4cf5486689de 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1906,7 +1906,7 @@ static int loop_add(struct loop_device **l, int i)
lo->tag_set.queue_depth = 128;
lo->tag_set.numa_node = NUMA_NO_NODE;
lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-   lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
lo->tag_set.driver_data = lo;
 
err = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 08696f5f00bb..999c94de78e5 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1570,7 +1570,7 @@ static int nbd_dev_add(int index)
nbd->tag_set.numa_node = NUMA_NO_NODE;
nbd->tag_set.cmd_size = sizeof(struct nbd_cmd);
nbd->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-   BLK_MQ_F_SG_MERGE | BLK_MQ_F_BLOCKING;
+   BLK_MQ_F_BLOCKING;
nbd->tag_set.driver_data = nbd;
 
err = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 8e5140bbf241..3dfd300b5283 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3988,7 +3988,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
rbd_dev->tag_set.ops = _mq_ops;
rbd_dev->tag_set.queue_depth = rbd_dev->opts->queue_depth;
rbd_dev->tag_set.numa_node = NUMA_NO_NODE;
-   rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
rbd_dev->tag_set.nr_hw_queues = 1;
rbd_dev->tag_set.cmd_size = sizeof(struct work_struct);
 
diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index a10d5736d8f7..a7040f9a1b1b 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -2843,7 +2843,6 @@ static int skd_cons_disk(struct skd_device *skdev)
skdev->sgs_per_request * sizeof(struct scatterlist);
skdev->tag_set.numa_node = NUMA_NO_NODE;
skdev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-   BLK_MQ_F_SG_MERGE |
BLK_ALLOC_POLICY_TO_MQ_FLAG(BLK_TAG_ALLOC_FIFO);
skdev->tag_set.driver_data = skdev;
rc = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 0ed4b200fa58..d43a5677ccbc 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -977,7 +977,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
} else
info->tag_set.queue_depth = BLK_RING_SIZE(info);
info->tag_set.numa_node = NUMA_NO_NODE;
-   info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
info->tag_set.cmd_size = sizeof(struct blkif_req);
info->tag_set.driver_data = info;
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 1f1fe9a618ea..afbac62a02a2 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -536,7 +536,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, 
struct dm_table *t)
md->tag_set->ops = _mq_ops;
md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
md->tag_set->numa_node = md->numa_node_id;
-   md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
md->tag_set->driver_data = md;
 
diff --git a/drivers/mmc/core/queue.c b/drivers/mmc/core/queue.c
index 35cc138b096d..cc19e71c71d4 100644
--- a/drivers/mmc/core/queue.c
+++ b/drivers/mmc/core/queue.c
@@ -410,8 +410,7 @@ int mmc_init_queue(struct mmc_queue *mq, struct mmc_card 
*card)
else
mq->tag_set.queue_depth = MMC_QUEUE_DEPTH;
mq->tag_set.numa_node = NUMA_NO_NODE;
-  

[Cluster-devel] [PATCH V11 18/19] block: kill QUEUE_FLAG_NO_SG_MERGE

2018-11-20 Thread Ming Lei
Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.

Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.

Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().

For another user of blk_recalc_rq_segments():

- run in partial completion branch of blk_update_request, which is an unusual 
case

- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is 
killed
since dm-rq is the only user.

Multi-page bvec is enabled now, not doing S/G merging is rather pointless with 
the
current setup of the I/O path, as it isn't going to save you a significant 
amount
of cycles.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c  | 31 ++-
 block/blk-mq-debugfs.c |  1 -
 block/blk-mq.c |  3 ---
 drivers/md/dm-table.c  | 13 -
 include/linux/blkdev.h |  1 -
 5 files changed, 6 insertions(+), 43 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 7c44216c1b58..8fcac7855a45 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -343,8 +343,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio)
 EXPORT_SYMBOL(blk_queue_split);
 
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
-struct bio *bio,
-bool no_sg_merge)
+struct bio *bio)
 {
struct bio_vec bv, bvprv = { NULL };
int cluster, prev = 0;
@@ -371,13 +370,6 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
nr_phys_segs = 0;
for_each_bio(bio) {
bio_for_each_bvec(bv, bio, iter) {
-   /*
-* If SG merging is disabled, each bio vector is
-* a segment
-*/
-   if (no_sg_merge)
-   goto new_segment;
-
if (prev && cluster) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
@@ -412,27 +404,16 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
 
 void blk_recalc_rq_segments(struct request *rq)
 {
-   bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE,
-   >q->queue_flags);
-
-   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio,
-   no_sg_merge);
+   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);
 }
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-   unsigned short seg_cnt = bio_segments(bio);
-
-   if (test_bit(QUEUE_FLAG_NO_SG_MERGE, >queue_flags) &&
-   (seg_cnt < queue_max_segments(q)))
-   bio->bi_phys_segments = seg_cnt;
-   else {
-   struct bio *nxt = bio->bi_next;
+   struct bio *nxt = bio->bi_next;
 
-   bio->bi_next = NULL;
-   bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio, false);
-   bio->bi_next = nxt;
-   }
+   bio->bi_next = NULL;
+   bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio);
+   bio->bi_next = nxt;
 
bio_set_flag(bio, BIO_SEG_VALID);
 }
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index a32bb79d6c95..d752fe4461af 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -127,7 +127,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(SAME_FORCE),
QUEUE_FLAG_NAME(DEAD),
QUEUE_FLAG_NAME(INIT_DONE),
-   QUEUE_FLAG_NAME(NO_SG_MERGE),
QUEUE_FLAG_NAME(POLL),
QUEUE_FLAG_NAME(WC),
QUEUE_FLAG_NAME(FUA),
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 32b246ed44c0..0375c3bd410e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2755,9 +2755,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
 
q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
 
-   if (!(set->flags & BLK_MQ_F_SG_MERGE))
-   blk_queue_flag_set(QUEUE_FLAG_NO_SG_MERGE, q);
-
q->sg_reserved_size = INT_MAX;
 
INIT_DELAYED_WORK(>requeue_work, blk_mq_requeue_work);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 844f7d0f2ef8..a41832cf0c98 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1698,14 +1698,6 @@ static int device_is_not_random(struct dm_target *ti, 
struct dm_dev *dev,
return q && !blk_queue_add_random(q);
 }
 
-static int queue_supports_sg_merge(struct dm_target *ti, struct dm_dev *dev,
-  sector_t start, sector_t len, void *data)
-{
-   struct 

[Cluster-devel] [PATCH V11 17/19] block: document usage of bio iterator helpers

2018-11-20 Thread Ming Lei
Now multi-page bvec is supported, some helpers may return page by
page, meantime some may return segment by segment, this patch
documents the usage.

Signed-off-by: Ming Lei 
---
 Documentation/block/biovecs.txt | 24 
 1 file changed, 24 insertions(+)

diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 25689584e6e0..bb008f7afb05 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -117,3 +117,27 @@ Other implications:
size limitations and the limitations of the underlying devices. Thus
there's no need to define ->merge_bvec_fn() callbacks for individual block
drivers.
+
+Usage of helpers:
+=
+
+* The following helpers whose names have the suffix of "_all" can only be used
+on non-BIO_CLONED bio. They are usually used by filesystem code. Drivers
+shouldn't use them because the bio may have been split before it reached the
+driver.
+
+   bio_for_each_segment_all()
+   bio_first_bvec_all()
+   bio_first_page_all()
+   bio_last_bvec_all()
+
+* The following helpers iterate over single-page bvecs. The passed 'struct
+bio_vec' will contain a single-page IO vector during the iteration
+
+   bio_for_each_segment()
+   bio_for_each_segment_all()
+
+* The following helpers iterate over single-page bvecs. The passed 'struct
+bio_vec' will contain a single-page IO vector during the iteration
+
+   bio_for_each_bvec()
-- 
2.9.5



[Cluster-devel] [PATCH V11 16/19] block: always define BIO_MAX_PAGES as 256

2018-11-20 Thread Ming Lei
Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
increase BIO_MAX_PAGES for it.

CONFIG_THP_SWAP needs to split one THP into normal pages and adds
them all to one bio. With multipage-bvec, it just takes one bvec to
hold them all.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 include/linux/bio.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7edad188568a..e5b975fa0558 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -34,15 +34,7 @@
 #define BIO_BUG_ON
 #endif
 
-#ifdef CONFIG_THP_SWAP
-#if HPAGE_PMD_NR > 256
-#define BIO_MAX_PAGES  HPAGE_PMD_NR
-#else
 #define BIO_MAX_PAGES  256
-#endif
-#else
-#define BIO_MAX_PAGES  256
-#endif
 
 #define bio_prio(bio)  (bio)->bi_ioprio
 #define bio_set_prio(bio, prio)((bio)->bi_ioprio = prio)
-- 
2.9.5



[Cluster-devel] [PATCH V11 15/19] block: enable multipage bvecs

2018-11-20 Thread Ming Lei
This patch pulls the trigger for multi-page bvecs.

Signed-off-by: Ming Lei 
---
 block/bio.c   | 32 +++-
 fs/iomap.c|  2 +-
 fs/xfs/xfs_aops.c |  2 +-
 3 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 0f1635b9ec50..854676edc438 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -823,7 +823,7 @@ EXPORT_SYMBOL(bio_add_pc_page);
  * @len: length of the data to add
  * @off: offset of the data in @page
  *
- * Try to add the data at @page + @off to the last bvec of @bio.  This is a
+ * Try to add the data at @page + @off to the last page of @bio.  This is a
  * a useful optimisation for file systems with a block size smaller than the
  * page size.
  *
@@ -836,10 +836,13 @@ bool __bio_try_merge_page(struct bio *bio, struct page 
*page,
return false;
 
if (bio->bi_vcnt > 0) {
-   struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
+   struct bio_vec bv;
+   struct bio_vec *seg = >bi_io_vec[bio->bi_vcnt - 1];
 
-   if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
-   bv->bv_len += len;
+   bvec_last_segment(seg, );
+
+   if (page == bv.bv_page && off == bv.bv_offset + bv.bv_len) {
+   seg->bv_len += len;
bio->bi_iter.bi_size += len;
return true;
}
@@ -848,6 +851,25 @@ bool __bio_try_merge_page(struct bio *bio, struct page 
*page,
 }
 EXPORT_SYMBOL_GPL(__bio_try_merge_page);
 
+static bool bio_try_merge_segment(struct bio *bio, struct page *page,
+ unsigned int len, unsigned int off)
+{
+   if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
+   return false;
+
+   if (bio->bi_vcnt > 0) {
+   struct bio_vec *seg = >bi_io_vec[bio->bi_vcnt - 1];
+
+   if (page_to_phys(seg->bv_page) + seg->bv_offset + seg->bv_len ==
+   page_to_phys(page) + off) {
+   seg->bv_len += len;
+   bio->bi_iter.bi_size += len;
+   return true;
+   }
+   }
+   return false;
+}
+
 /**
  * __bio_add_page - add page to a bio in a new segment
  * @bio: destination bio
@@ -888,7 +910,7 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
 int bio_add_page(struct bio *bio, struct page *page,
 unsigned int len, unsigned int offset)
 {
-   if (!__bio_try_merge_page(bio, page, len, offset)) {
+   if (!bio_try_merge_segment(bio, page, len, offset)) {
if (bio_full(bio))
return 0;
__bio_add_page(bio, page, len, offset);
diff --git a/fs/iomap.c b/fs/iomap.c
index f5fb8bf75cc8..ccc2ba115f4d 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -344,7 +344,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
ctx->bio->bi_end_io = iomap_read_end_io;
}
 
-   __bio_add_page(ctx->bio, page, plen, poff);
+   bio_add_page(ctx->bio, page, plen, poff);
 done:
/*
 * Move the caller beyond our range so that it keeps making progress.
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1f1829e506e8..5c2190216614 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -621,7 +621,7 @@ xfs_add_to_ioend(
atomic_inc(>write_count);
if (bio_full(wpc->ioend->io_bio))
xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
-   __bio_add_page(wpc->ioend->io_bio, page, len, poff);
+   bio_add_page(wpc->ioend->io_bio, page, len, poff);
}
 
wpc->ioend->io_size += len;
-- 
2.9.5



[Cluster-devel] [PATCH V11 12/19] block: allow bio_for_each_segment_all() to iterate over multi-page bvec

2018-11-20 Thread Ming Lei
This patch introduces one extra iterator variable to bio_for_each_segment_all(),
then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

Given it is just one mechannical & simple change on all 
bio_for_each_segment_all()
users, this patch does tree-wide change in one single patch, so that we can
avoid to use a temporary helper for this conversion.

Signed-off-by: Ming Lei 
---
 block/bio.c   | 27 ++-
 block/bounce.c|  6 --
 drivers/md/bcache/btree.c |  3 ++-
 drivers/md/dm-crypt.c |  3 ++-
 drivers/md/raid1.c|  3 ++-
 drivers/staging/erofs/data.c  |  3 ++-
 drivers/staging/erofs/unzip_vle.c |  3 ++-
 fs/block_dev.c|  6 --
 fs/btrfs/compression.c|  3 ++-
 fs/btrfs/disk-io.c|  3 ++-
 fs/btrfs/extent_io.c  | 12 
 fs/btrfs/inode.c  |  6 --
 fs/btrfs/raid56.c |  3 ++-
 fs/crypto/bio.c   |  3 ++-
 fs/direct-io.c|  4 +++-
 fs/exofs/ore.c|  3 ++-
 fs/exofs/ore_raid.c   |  3 ++-
 fs/ext4/page-io.c |  3 ++-
 fs/ext4/readpage.c|  3 ++-
 fs/f2fs/data.c|  9 ++---
 fs/gfs2/lops.c|  6 --
 fs/gfs2/meta_io.c |  3 ++-
 fs/iomap.c|  6 --
 fs/mpage.c|  3 ++-
 fs/xfs/xfs_aops.c |  5 +++--
 include/linux/bio.h   | 11 +--
 include/linux/bvec.h  | 31 +++
 27 files changed, 128 insertions(+), 46 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 4f4d9884443b..2680aa42a625 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1073,8 +1073,9 @@ static int bio_copy_from_iter(struct bio *bio, struct 
iov_iter *iter)
 {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
ssize_t ret;
 
ret = copy_page_from_iter(bvec->bv_page,
@@ -1104,8 +1105,9 @@ static int bio_copy_to_iter(struct bio *bio, struct 
iov_iter iter)
 {
int i;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
ssize_t ret;
 
ret = copy_page_to_iter(bvec->bv_page,
@@ -1127,8 +1129,9 @@ void bio_free_pages(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i)
+   bio_for_each_segment_all(bvec, bio, i, iter_all)
__free_page(bvec->bv_page);
 }
 EXPORT_SYMBOL(bio_free_pages);
@@ -1295,6 +1298,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
struct bio *bio;
int ret;
struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
 
if (!iov_iter_count(iter))
return ERR_PTR(-EINVAL);
@@ -1368,7 +1372,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
return bio;
 
  out_unmap:
-   bio_for_each_segment_all(bvec, bio, j) {
+   bio_for_each_segment_all(bvec, bio, j, iter_all) {
put_page(bvec->bv_page);
}
bio_put(bio);
@@ -1379,11 +1383,12 @@ static void __bio_unmap_user(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
/*
 * make sure we dirty pages we wrote to
 */
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
if (bio_data_dir(bio) == READ)
set_page_dirty_lock(bvec->bv_page);
 
@@ -1475,8 +1480,9 @@ static void bio_copy_kern_endio_read(struct bio *bio)
char *p = bio->bi_private;
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
memcpy(p, page_address(bvec->bv_page), bvec->bv_len);
p += bvec->bv_len;
}
@@ -1585,8 +1591,9 @@ void bio_set_pages_dirty(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i) {
+   bio_for_each_segment_all(bvec, bio, i, iter_all) {
if (!PageCompound(bvec->bv_page))
set_page_dirty_lock(bvec->bv_page);
}
@@ -1597,8 +1604,9 @@ static void bio_release_pages(struct bio *bio)
 {
struct bio_vec *bvec;
int i;
+   struct bvec_iter_all iter_all;
 
-   bio_for_each_segment_all(bvec, bio, i)
+   bio_for_each_segment_all(bvec, bio, i, 

[Cluster-devel] [PATCH V11 14/19] block: handle non-cluster bio out of blk_bio_segment_split

2018-11-20 Thread Ming Lei
We will enable multi-page bvec soon, but non-cluster queue can't
handle the multi-page bvec at all. This patch borrows bounce's
idea to clone new single-page bio for non-cluster queue, and moves
its handling out of blk_bio_segment_split().

Signed-off-by: Ming Lei 
---
 block/Makefile  |  3 ++-
 block/blk-merge.c   |  6 -
 block/blk.h |  2 ++
 block/non-cluster.c | 70 +
 4 files changed, 79 insertions(+), 2 deletions(-)
 create mode 100644 block/non-cluster.c

diff --git a/block/Makefile b/block/Makefile
index eee1b4ceecf9..e07d59438c4b 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -9,7 +9,8 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o blk-sysfs.o \
blk-lib.o blk-mq.o blk-mq-tag.o blk-stat.o \
blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o partition-generic.o ioprio.o \
-   badblocks.o partitions/ blk-rq-qos.o
+   badblocks.o partitions/ blk-rq-qos.o \
+   non-cluster.o
 
 obj-$(CONFIG_BOUNCE)   += bounce.o
 obj-$(CONFIG_BLK_SCSI_REQUEST) += scsi_ioctl.o
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 8829c51b4e75..7c44216c1b58 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -247,7 +247,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
goto split;
}
 
-   if (bvprvp && blk_queue_cluster(q)) {
+   if (bvprvp) {
if (seg_size + bv.bv_len > queue_max_segment_size(q))
goto new_segment;
if (!biovec_phys_mergeable(q, bvprvp, ))
@@ -307,6 +307,10 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio)
split = blk_bio_write_same_split(q, *bio, >bio_split, 
);
break;
default:
+   if (!blk_queue_cluster(q)) {
+   blk_queue_non_cluster_bio(q, bio);
+   return;
+   }
split = blk_bio_segment_split(q, *bio, >bio_split, );
break;
}
diff --git a/block/blk.h b/block/blk.h
index 31c0e45aba3a..6fc5821ced55 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -338,6 +338,8 @@ struct bio *blk_next_bio(struct bio *bio, unsigned int 
nr_pages, gfp_t gfp);
 
 struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, struct 
bio_set *bs);
 
+void blk_queue_non_cluster_bio(struct request_queue *q, struct bio **bio_orig);
+
 #ifdef CONFIG_BLK_DEV_ZONED
 void blk_queue_free_zone_bitmaps(struct request_queue *q);
 #else
diff --git a/block/non-cluster.c b/block/non-cluster.c
new file mode 100644
index ..9c2910be9404
--- /dev/null
+++ b/block/non-cluster.c
@@ -0,0 +1,70 @@
+// SPDX-License-Identifier: GPL-2.0
+/* non-cluster handling for block devices */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "blk.h"
+
+static struct bio_set non_cluster_bio_set, non_cluster_bio_split;
+
+static __init int init_non_cluster_bioset(void)
+{
+   WARN_ON(bioset_init(_cluster_bio_set, BIO_POOL_SIZE, 0,
+  BIOSET_NEED_BVECS));
+   WARN_ON(bioset_integrity_create(_cluster_bio_set, BIO_POOL_SIZE));
+   WARN_ON(bioset_init(_cluster_bio_split, BIO_POOL_SIZE, 0, 0));
+
+   return 0;
+}
+__initcall(init_non_cluster_bioset);
+
+static void non_cluster_end_io(struct bio *bio)
+{
+   struct bio *bio_orig = bio->bi_private;
+
+   bio_orig->bi_status = bio->bi_status;
+   bio_endio(bio_orig);
+   bio_put(bio);
+}
+
+void blk_queue_non_cluster_bio(struct request_queue *q, struct bio **bio_orig)
+{
+   struct bio *bio;
+   struct bvec_iter iter;
+   struct bio_vec from;
+   unsigned i = 0;
+   unsigned sectors = 0;
+   unsigned short max_segs = min_t(unsigned short, BIO_MAX_PAGES,
+   queue_max_segments(q));
+
+   bio_for_each_segment(from, *bio_orig, iter) {
+   if (i++ < max_segs)
+   sectors += from.bv_len >> 9;
+   else
+   break;
+   }
+
+   if (sectors < bio_sectors(*bio_orig)) {
+   bio = bio_split(*bio_orig, sectors, GFP_NOIO,
+   _cluster_bio_split);
+   bio_chain(bio, *bio_orig);
+   generic_make_request(*bio_orig);
+   *bio_orig = bio;
+   }
+   bio = bio_clone_bioset(*bio_orig, GFP_NOIO, _cluster_bio_set);
+
+   bio->bi_phys_segments = bio_segments(bio);
+bio_set_flag(bio, BIO_SEG_VALID);
+   bio->bi_end_io = non_cluster_end_io;
+
+   bio->bi_private = *bio_orig;
+   *bio_orig = bio;
+}
-- 
2.9.5



[Cluster-devel] [PATCH V11 13/19] block: move bounce_clone_bio into bio.c

2018-11-20 Thread Ming Lei
We will reuse bounce_clone_bio() for cloning bio in case of
!blk_queue_cluster(q), so move this helper into bio.c and
rename it as bio_clone_bioset().

No function change.

Signed-off-by: Ming Lei 
---
 block/bio.c| 69 +
 block/blk.h|  2 ++
 block/bounce.c | 70 +-
 3 files changed, 72 insertions(+), 69 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 2680aa42a625..0f1635b9ec50 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -647,6 +647,75 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t 
gfp_mask, struct bio_set *bs)
 }
 EXPORT_SYMBOL(bio_clone_fast);
 
+/* block core only helper */
+struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask,
+struct bio_set *bs)
+{
+   struct bvec_iter iter;
+   struct bio_vec bv;
+   struct bio *bio;
+
+   /*
+* Pre immutable biovecs, __bio_clone() used to just do a memcpy from
+* bio_src->bi_io_vec to bio->bi_io_vec.
+*
+* We can't do that anymore, because:
+*
+*  - The point of cloning the biovec is to produce a bio with a biovec
+*the caller can modify: bi_idx and bi_bvec_done should be 0.
+*
+*  - The original bio could've had more than BIO_MAX_PAGES biovecs; if
+*we tried to clone the whole thing bio_alloc_bioset() would fail.
+*But the clone should succeed as long as the number of biovecs we
+*actually need to allocate is fewer than BIO_MAX_PAGES.
+*
+*  - Lastly, bi_vcnt should not be looked at or relied upon by code
+*that does not own the bio - reason being drivers don't use it for
+*iterating over the biovec anymore, so expecting it to be kept up
+*to date (i.e. for clones that share the parent biovec) is just
+*asking for trouble and would force extra work on
+*__bio_clone_fast() anyways.
+*/
+
+   bio = bio_alloc_bioset(gfp_mask, bio_segments(bio_src), bs);
+   if (!bio)
+   return NULL;
+   bio->bi_disk= bio_src->bi_disk;
+   bio->bi_opf = bio_src->bi_opf;
+   bio->bi_ioprio  = bio_src->bi_ioprio;
+   bio->bi_write_hint  = bio_src->bi_write_hint;
+   bio->bi_iter.bi_sector  = bio_src->bi_iter.bi_sector;
+   bio->bi_iter.bi_size= bio_src->bi_iter.bi_size;
+
+   switch (bio_op(bio)) {
+   case REQ_OP_DISCARD:
+   case REQ_OP_SECURE_ERASE:
+   case REQ_OP_WRITE_ZEROES:
+   break;
+   case REQ_OP_WRITE_SAME:
+   bio->bi_io_vec[bio->bi_vcnt++] = bio_src->bi_io_vec[0];
+   break;
+   default:
+   bio_for_each_segment(bv, bio_src, iter)
+   bio->bi_io_vec[bio->bi_vcnt++] = bv;
+   break;
+   }
+
+   if (bio_integrity(bio_src)) {
+   int ret;
+
+   ret = bio_integrity_clone(bio, bio_src, gfp_mask);
+   if (ret < 0) {
+   bio_put(bio);
+   return NULL;
+   }
+   }
+
+   bio_clone_blkcg_association(bio, bio_src);
+
+   return bio;
+}
+
 /**
  * bio_add_pc_page -   attempt to add page to bio
  * @q: the target queue
diff --git a/block/blk.h b/block/blk.h
index 816a9abb87cd..31c0e45aba3a 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -336,6 +336,8 @@ static inline int blk_iolatency_init(struct request_queue 
*q) { return 0; }
 
 struct bio *blk_next_bio(struct bio *bio, unsigned int nr_pages, gfp_t gfp);
 
+struct bio *bio_clone_bioset(struct bio *bio_src, gfp_t gfp_mask, struct 
bio_set *bs);
+
 #ifdef CONFIG_BLK_DEV_ZONED
 void blk_queue_free_zone_bitmaps(struct request_queue *q);
 #else
diff --git a/block/bounce.c b/block/bounce.c
index 7338041e3042..4947c36173b2 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -215,74 +215,6 @@ static void bounce_end_io_read_isa(struct bio *bio)
__bounce_end_io_read(bio, _page_pool);
 }
 
-static struct bio *bounce_clone_bio(struct bio *bio_src, gfp_t gfp_mask,
-   struct bio_set *bs)
-{
-   struct bvec_iter iter;
-   struct bio_vec bv;
-   struct bio *bio;
-
-   /*
-* Pre immutable biovecs, __bio_clone() used to just do a memcpy from
-* bio_src->bi_io_vec to bio->bi_io_vec.
-*
-* We can't do that anymore, because:
-*
-*  - The point of cloning the biovec is to produce a bio with a biovec
-*the caller can modify: bi_idx and bi_bvec_done should be 0.
-*
-*  - The original bio could've had more than BIO_MAX_PAGES biovecs; if
-*we tried to clone the whole thing bio_alloc_bioset() would fail.
-*But the clone should succeed as long as the number of biovecs we
-*actually need to allocate is fewer than BIO_MAX_PAGES.
- 

[Cluster-devel] [PATCH V11 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()

2018-11-20 Thread Ming Lei
bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.

Acked-by: Coly Li 
Signed-off-by: Ming Lei 
---
 drivers/md/bcache/util.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 20eddeac1531..62fb917f7a4f 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -270,7 +270,11 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
int i;
struct bio_vec *bv;
 
-   bio_for_each_segment_all(bv, bio, i) {
+   /*
+* This is called on freshly new bio, so it is safe to access the
+* bvec table directly.
+*/
+   for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++, i++) {
bv->bv_page = alloc_page(gfp_mask);
if (!bv->bv_page) {
while (--bv >= bio->bi_io_vec)
-- 
2.9.5



[Cluster-devel] [PATCH V11 10/19] block: loop: pass multi-page bvec to iov_iter

2018-11-20 Thread Ming Lei
iov_iter is implemented on bvec itererator helpers, so it is safe to pass
multi-page bvec to it, and this way is much more efficient than passing one
page in each bvec.

Signed-off-by: Ming Lei 
---
 drivers/block/loop.c   | 20 ++--
 include/linux/blkdev.h |  4 
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 176ab1f28eca..e3683211f12d 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -510,21 +510,22 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 loff_t pos, bool rw)
 {
struct iov_iter iter;
+   struct req_iterator rq_iter;
struct bio_vec *bvec;
struct request *rq = blk_mq_rq_from_pdu(cmd);
struct bio *bio = rq->bio;
struct file *file = lo->lo_backing_file;
+   struct bio_vec tmp;
unsigned int offset;
-   int segments = 0;
+   int nr_bvec = 0;
int ret;
 
+   rq_for_each_bvec(tmp, rq, rq_iter)
+   nr_bvec++;
+
if (rq->bio != rq->biotail) {
-   struct req_iterator iter;
-   struct bio_vec tmp;
 
-   __rq_for_each_bio(bio, rq)
-   segments += bio_segments(bio);
-   bvec = kmalloc_array(segments, sizeof(struct bio_vec),
+   bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
 GFP_NOIO);
if (!bvec)
return -EIO;
@@ -533,10 +534,10 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
/*
 * The bios of the request may be started from the middle of
 * the 'bvec' because of bio splitting, so we can't directly
-* copy bio->bi_iov_vec to new bvec. The rq_for_each_segment
+* copy bio->bi_iov_vec to new bvec. The rq_for_each_bvec
 * API will take care of all details for us.
 */
-   rq_for_each_segment(tmp, rq, iter) {
+   rq_for_each_bvec(tmp, rq, rq_iter) {
*bvec = tmp;
bvec++;
}
@@ -550,11 +551,10 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 */
offset = bio->bi_iter.bi_bvec_done;
bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
-   segments = bio_segments(bio);
}
atomic_set(>ref, 2);
 
-   iov_iter_bvec(, rw, bvec, segments, blk_rq_bytes(rq));
+   iov_iter_bvec(, rw, bvec, nr_bvec, blk_rq_bytes(rq));
iter.iov_offset = offset;
 
cmd->iocb.ki_pos = pos;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1ad6eafc43f2..a281b6737b61 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -805,6 +805,10 @@ struct req_iterator {
__rq_for_each_bio(_iter.bio, _rq)   \
bio_for_each_segment(bvl, _iter.bio, _iter.iter)
 
+#define rq_for_each_bvec(bvl, _rq, _iter)  \
+   __rq_for_each_bio(_iter.bio, _rq)   \
+   bio_for_each_bvec(bvl, _iter.bio, _iter.iter)
+
 #define rq_iter_last(bvec, _iter)  \
(_iter.bio->bi_next == NULL &&  \
 bio_iter_last(bvec, _iter.iter))
-- 
2.9.5



[Cluster-devel] [PATCH V11 09/19] btrfs: move bio_pages_all() to btrfs

2018-11-20 Thread Ming Lei
BTRFS is the only user of this helper, so move this helper into
BTRFS, and implement it via bio_for_each_segment_all(), since
bio->bi_vcnt may not equal to number of pages after multipage bvec
is enabled.

Signed-off-by: Ming Lei 
---
 fs/btrfs/extent_io.c | 14 +-
 include/linux/bio.h  |  6 --
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5d5965297e7e..874bb9aeebdc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2348,6 +2348,18 @@ struct bio *btrfs_create_repair_bio(struct inode *inode, 
struct bio *failed_bio,
return bio;
 }
 
+static unsigned btrfs_bio_pages_all(struct bio *bio)
+{
+   unsigned i;
+   struct bio_vec *bv;
+
+   WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
+
+   bio_for_each_segment_all(bv, bio, i)
+   ;
+   return i;
+}
+
 /*
  * this is a generic handler for readpage errors (default
  * readpage_io_failed_hook). if other copies exist, read those and write back
@@ -2368,7 +2380,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
int read_mode = 0;
blk_status_t status;
int ret;
-   unsigned failed_bio_pages = bio_pages_all(failed_bio);
+   unsigned failed_bio_pages = btrfs_bio_pages_all(failed_bio);
 
BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7560209d6a8a..9d6284f53c07 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -282,12 +282,6 @@ static inline void bio_get_last_bvec(struct bio *bio, 
struct bio_vec *bv)
bv->bv_len = iter.bi_bvec_done;
 }
 
-static inline unsigned bio_pages_all(struct bio *bio)
-{
-   WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
-   return bio->bi_vcnt;
-}
-
 static inline struct bio_vec *bio_first_bvec_all(struct bio *bio)
 {
WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
-- 
2.9.5



[Cluster-devel] [PATCH V11 07/19] fs/buffer.c: use bvec iterator to truncate the bio

2018-11-20 Thread Ming Lei
Once multi-page bvec is enabled, the last bvec may include more than one
page, this patch use bvec_last_segment() to truncate the bio.

Reviewed-by: Omar Sandoval 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 fs/buffer.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 1286c2b95498..fa37ad52e962 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3032,7 +3032,10 @@ void guard_bio_eod(int op, struct bio *bio)
 
/* ..and clear the end of the buffer for reads */
if (op == REQ_OP_READ) {
-   zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
+   struct bio_vec bv;
+
+   bvec_last_segment(bvec, );
+   zero_user(bv.bv_page, bv.bv_offset + bv.bv_len,
truncated_bytes);
}
 }
-- 
2.9.5



[Cluster-devel] [PATCH V11 08/19] btrfs: use bvec_last_segment to get bio's last page

2018-11-20 Thread Ming Lei
Preparing for supporting multi-page bvec.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 fs/btrfs/extent_io.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d228f706ff3e..5d5965297e7e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2720,11 +2720,12 @@ static int __must_check submit_one_bio(struct bio *bio, 
int mirror_num,
 {
blk_status_t ret = 0;
struct bio_vec *bvec = bio_last_bvec_all(bio);
-   struct page *page = bvec->bv_page;
+   struct bio_vec bv;
struct extent_io_tree *tree = bio->bi_private;
u64 start;
 
-   start = page_offset(page) + bvec->bv_offset;
+   bvec_last_segment(bvec, );
+   start = page_offset(bv.bv_page) + bv.bv_offset;
 
bio->bi_private = NULL;
 
-- 
2.9.5



[Cluster-devel] [PATCH V11 06/19] block: introduce bvec_last_segment()

2018-11-20 Thread Ming Lei
BTRFS and guard_bio_eod() need to get the last singlepage segment
from one multipage bvec, so introduce this helper to make them happy.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index b279218c5c4d..b37d13a79a7d 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -173,4 +173,26 @@ static inline bool bvec_iter_advance(const struct bio_vec 
*bv,
.bi_bvec_done   = 0,\
 }
 
+/*
+ * Get the last single-page segment from the multi-page bvec and store it
+ * in @seg
+ */
+static inline void bvec_last_segment(const struct bio_vec *bvec,
+struct bio_vec *seg)
+{
+   unsigned total = bvec->bv_offset + bvec->bv_len;
+   unsigned last_page = (total - 1) / PAGE_SIZE;
+
+   seg->bv_page = nth_page(bvec->bv_page, last_page);
+
+   /* the whole segment is inside the last page */
+   if (bvec->bv_offset >= last_page * PAGE_SIZE) {
+   seg->bv_offset = bvec->bv_offset % PAGE_SIZE;
+   seg->bv_len = bvec->bv_len;
+   } else {
+   seg->bv_offset = 0;
+   seg->bv_len = total - last_page * PAGE_SIZE;
+   }
+}
+
 #endif /* __LINUX_BVEC_ITER_H */
-- 
2.9.5



[Cluster-devel] [PATCH V11 05/19] block: use bio_for_each_bvec() to map sg

2018-11-20 Thread Ming Lei
It is more efficient to use bio_for_each_bvec() to map sg, meantime
we have to consider splitting multipage bvec as done in blk_bio_segment_split().

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 68 +++
 1 file changed, 48 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index ec0b93fa1ff8..8829c51b4e75 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -455,6 +455,52 @@ static int blk_phys_contig_segment(struct request_queue 
*q, struct bio *bio,
return biovec_phys_mergeable(q, _bv, _bv);
 }
 
+static struct scatterlist *blk_next_sg(struct scatterlist **sg,
+   struct scatterlist *sglist)
+{
+   if (!*sg)
+   return sglist;
+
+   /*
+* If the driver previously mapped a shorter list, we could see a
+* termination bit prematurely unless it fully inits the sg table
+* on each mapping. We KNOW that there must be more entries here
+* or the driver would be buggy, so force clear the termination bit
+* to avoid doing a full sg_init_table() in drivers for each command.
+*/
+   sg_unmark_end(*sg);
+   return sg_next(*sg);
+}
+
+static unsigned blk_bvec_map_sg(struct request_queue *q,
+   struct bio_vec *bvec, struct scatterlist *sglist,
+   struct scatterlist **sg)
+{
+   unsigned nbytes = bvec->bv_len;
+   unsigned nsegs = 0, total = 0;
+
+   while (nbytes > 0) {
+   unsigned seg_size;
+   struct page *pg;
+   unsigned offset, idx;
+
+   *sg = blk_next_sg(sg, sglist);
+
+   seg_size = min(nbytes, queue_max_segment_size(q));
+   offset = (total + bvec->bv_offset) % PAGE_SIZE;
+   idx = (total + bvec->bv_offset) / PAGE_SIZE;
+   pg = nth_page(bvec->bv_page, idx);
+
+   sg_set_page(*sg, pg, seg_size, offset);
+
+   total += seg_size;
+   nbytes -= seg_size;
+   nsegs++;
+   }
+
+   return nsegs;
+}
+
 static inline void
 __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 struct scatterlist *sglist, struct bio_vec *bvprv,
@@ -472,25 +518,7 @@ __blk_segment_map_sg(struct request_queue *q, struct 
bio_vec *bvec,
(*sg)->length += nbytes;
} else {
 new_segment:
-   if (!*sg)
-   *sg = sglist;
-   else {
-   /*
-* If the driver previously mapped a shorter
-* list, we could see a termination bit
-* prematurely unless it fully inits the sg
-* table on each mapping. We KNOW that there
-* must be more entries here or the driver
-* would be buggy, so force clear the
-* termination bit to avoid doing a full
-* sg_init_table() in drivers for each command.
-*/
-   sg_unmark_end(*sg);
-   *sg = sg_next(*sg);
-   }
-
-   sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
-   (*nsegs)++;
+   (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
}
*bvprv = *bvec;
 }
@@ -512,7 +540,7 @@ static int __blk_bios_map_sg(struct request_queue *q, 
struct bio *bio,
int cluster = blk_queue_cluster(q), nsegs = 0;
 
for_each_bio(bio)
-   bio_for_each_segment(bvec, bio, iter)
+   bio_for_each_bvec(bvec, bio, iter)
__blk_segment_map_sg(q, , sglist, , sg,
 , );
 
-- 
2.9.5



[Cluster-devel] [PATCH V11 04/19] block: use bio_for_each_bvec() to compute multi-page bvec count

2018-11-20 Thread Ming Lei
First it is more efficient to use bio_for_each_bvec() in both
blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
many multi-page bvecs there are in the bio.

Secondly once bio_for_each_bvec() is used, the bvec may need to be
splitted because its length can be very longer than max segment size,
so we have to split the big bvec into several segments.

Thirdly when splitting multi-page bvec into segments, the max segment
limit may be reached, so the bio split need to be considered under
this situation too.

Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 87 +++
 1 file changed, 68 insertions(+), 19 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index f52400ce2187..ec0b93fa1ff8 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -161,6 +161,54 @@ static inline unsigned get_max_io_size(struct 
request_queue *q,
return sectors;
 }
 
+/*
+ * Split the bvec @bv into segments, and update all kinds of
+ * variables.
+ */
+static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
+   unsigned *nsegs, unsigned *last_seg_size,
+   unsigned *front_seg_size, unsigned *sectors)
+{
+   unsigned len = bv->bv_len;
+   unsigned total_len = 0;
+   unsigned new_nsegs = 0, seg_size = 0;
+
+   /*
+* Multipage bvec may be too big to hold in one segment,
+* so the current bvec has to be splitted as multiple
+* segments.
+*/
+   while (len && new_nsegs + *nsegs < queue_max_segments(q)) {
+   seg_size = min(queue_max_segment_size(q), len);
+
+   new_nsegs++;
+   total_len += seg_size;
+   len -= seg_size;
+
+   if ((bv->bv_offset + total_len) & queue_virt_boundary(q))
+   break;
+   }
+
+   /* update front segment size */
+   if (!*nsegs) {
+   unsigned first_seg_size = seg_size;
+
+   if (new_nsegs > 1)
+   first_seg_size = queue_max_segment_size(q);
+   if (*front_seg_size < first_seg_size)
+   *front_seg_size = first_seg_size;
+   }
+
+   /* update other varibles */
+   *last_seg_size = seg_size;
+   *nsegs += new_nsegs;
+   if (sectors)
+   *sectors += total_len >> 9;
+
+   /* split in the middle of the bvec if len != 0 */
+   return !!len;
+}
+
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 struct bio *bio,
 struct bio_set *bs,
@@ -174,7 +222,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
 
-   bio_for_each_segment(bv, bio, iter) {
+   bio_for_each_bvec(bv, bio, iter) {
/*
 * If the queue doesn't support SG gaps and adding this
 * offset would create a gap, disallow it.
@@ -189,8 +237,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 */
if (nsegs < queue_max_segments(q) &&
sectors < max_sectors) {
-   nsegs++;
-   sectors = max_sectors;
+   /* split in the middle of bvec */
+   bv.bv_len = (max_sectors - sectors) << 9;
+   bvec_split_segs(q, , ,
+   _size,
+   _seg_size,
+   );
}
goto split;
}
@@ -212,14 +264,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
if (nsegs == queue_max_segments(q))
goto split;
 
-   if (nsegs == 1 && seg_size > front_seg_size)
-   front_seg_size = seg_size;
-
-   nsegs++;
bvprv = bv;
bvprvp = 
-   seg_size = bv.bv_len;
-   sectors += bv.bv_len >> 9;
+
+   if (bvec_split_segs(q, , , _size,
+   _seg_size, ))
+   goto split;
 
}
 
@@ -233,8 +283,6 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
bio = new;
}
 
-   if (nsegs == 1 && seg_size > front_seg_size)
-   front_seg_size = seg_size;
bio->bi_seg_front_size = front_seg_size;
if (seg_size > bio->bi_seg_back_size)
bio->bi_seg_back_size = seg_size;
@@ -297,6 +345,7 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
struct bio_vec bv, bvprv = { NULL };
int cluster, prev = 0;

[Cluster-devel] [PATCH V11 03/19] block: introduce bio_for_each_bvec()

2018-11-20 Thread Ming Lei
This helper is used for iterating over multi-page bvec for bio
split & merge code.

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bio.h  | 25 ++---
 include/linux/bvec.h | 36 +---
 2 files changed, 51 insertions(+), 10 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 056fb627edb3..7560209d6a8a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -76,6 +76,9 @@
 #define bio_data_dir(bio) \
(op_is_write(bio_op(bio)) ? WRITE : READ)
 
+#define bio_iter_mp_iovec(bio, iter)   \
+   segment_iter_bvec((bio)->bi_io_vec, (iter))
+
 /*
  * Check whether this bio carries any data or not. A NULL bio is allowed.
  */
@@ -135,18 +138,24 @@ static inline bool bio_full(struct bio *bio)
 #define bio_for_each_segment_all(bvl, bio, i)  \
for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
 
-static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
-   unsigned bytes)
+static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
+ unsigned bytes, unsigned max_seg_len)
 {
iter->bi_sector += bytes >> 9;
 
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
else
-   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_seg_len);
/* TODO: It is reasonable to complete bio with error here. */
 }
 
+static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
+   unsigned bytes)
+{
+   __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
+}
+
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
for (iter = (start);\
 (iter).bi_size &&  \
@@ -156,6 +165,16 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 #define bio_for_each_segment(bvl, bio, iter)   \
__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
 
+#define __bio_for_each_bvec(bvl, bio, iter, start) \
+   for (iter = (start);\
+(iter).bi_size &&  \
+   ((bvl = bio_iter_mp_iovec((bio), (iter))), 1);  \
+__bio_advance_iter((bio), &(iter), (bvl).bv_len, BVEC_MAX_LEN))
+
+/* returns one real segment(multi-page bvec) each time */
+#define bio_for_each_bvec(bvl, bio, iter)  \
+   __bio_for_each_bvec(bvl, bio, iter, (bio)->bi_iter)
+
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
 static inline unsigned bio_segments(struct bio *bio)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ed90bbf4c9c9..b279218c5c4d 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -25,6 +25,8 @@
 #include 
 #include 
 
+#define BVEC_MAX_LEN  ((unsigned int)-1)
+
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
  */
@@ -87,8 +89,15 @@ struct bvec_iter {
.bv_offset  = bvec_iter_offset((bvec), (iter)), \
 })
 
-static inline bool bvec_iter_advance(const struct bio_vec *bv,
-   struct bvec_iter *iter, unsigned bytes)
+#define segment_iter_bvec(bvec, iter)  \
+((struct bio_vec) {\
+   .bv_page= segment_iter_page((bvec), (iter)),\
+   .bv_len = segment_iter_len((bvec), (iter)), \
+   .bv_offset  = segment_iter_offset((bvec), (iter)),  \
+})
+
+static inline bool __bvec_iter_advance(const struct bio_vec *bv,
+   struct bvec_iter *iter, unsigned bytes, unsigned max_seg_len)
 {
if (WARN_ONCE(bytes > iter->bi_size,
 "Attempted to advance past end of bvec iter\n")) {
@@ -97,12 +106,18 @@ static inline bool bvec_iter_advance(const struct bio_vec 
*bv,
}
 
while (bytes) {
-   unsigned iter_len = bvec_iter_len(bv, *iter);
-   unsigned len = min(bytes, iter_len);
+   unsigned segment_len = segment_iter_len(bv, *iter);
 
-   bytes -= len;
-   iter->bi_size -= len;
-   iter->bi_bvec_done += len;
+   if (max_seg_len < BVEC_MAX_LEN)
+   segment_len = min_t(unsigned, segment_len,
+   max_seg_len -
+   bvec_iter_offset(bv, *iter));
+
+   segment_len = min(bytes, segment_len);
+
+   bytes -= segment_len;
+   iter->bi_size -= segment_len;
+   iter->bi_bvec_done += segment_len;
 
if 

[Cluster-devel] [PATCH V11 02/19] block: introduce multi-page bvec helpers

2018-11-20 Thread Ming Lei
This patch introduces helpers of 'segment_iter_*' for multipage
bvec support.

The introduced helpers treate one bvec as real multi-page segment,
which may include more than one pages.

The existed helpers of bvec_iter_* are interfaces for supporting current
bvec iterator which is thought as single-page by drivers, fs, dm and
etc. These introduced helpers will build single-page bvec in flight, so
this way won't break current bio/bvec users, which needn't any change.

Follows some multi-page bvec background:

- bvecs stored in bio->bi_io_vec is always multi-page style

- bvec(struct bio_vec) represents one physically contiguous I/O
  buffer, now the buffer may include more than one page after
  multi-page bvec is supported, and all these pages represented
  by one bvec is physically contiguous. Before multi-page bvec
  support, at most one page is included in one bvec, we call it
  single-page bvec.

- .bv_page of the bvec points to the 1st page in the multi-page bvec

- .bv_offset of the bvec is the offset of the buffer in the bvec

The effect on the current drivers/filesystem/dm/bcache/...:

- almost everyone supposes that one bvec only includes one single
  page, so we keep the sp interface not changed, for example,
  bio_for_each_segment() still returns single-page bvec

- bio_for_each_segment_all() will return single-page bvec too

- during iterating, iterator variable(struct bvec_iter) is always
  updated in multi-page bvec style, and bvec_iter_advance() is kept
  not changed

- returned(copied) single-page bvec is built in flight by bvec
  helpers from the stored multi-page bvec

Reviewed-by: Omar Sandoval 
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 02c73c6aa805..ed90bbf4c9c9 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -50,16 +51,35 @@ struct bvec_iter {
  */
 #define __bvec_iter_bvec(bvec, iter)   (&(bvec)[(iter).bi_idx])
 
-#define bvec_iter_page(bvec, iter) \
+#define segment_iter_page(bvec, iter)  \
(__bvec_iter_bvec((bvec), (iter))->bv_page)
 
-#define bvec_iter_len(bvec, iter)  \
+#define segment_iter_len(bvec, iter)   \
min((iter).bi_size, \
__bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
 
-#define bvec_iter_offset(bvec, iter)   \
+#define segment_iter_offset(bvec, iter)\
(__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
 
+#define segment_iter_page_idx(bvec, iter)  \
+   (segment_iter_offset((bvec), (iter)) / PAGE_SIZE)
+
+/*
+ *  of single-page segment.
+ *
+ * This helpers are for building single-page bvec in flight.
+ */
+#define bvec_iter_offset(bvec, iter)   \
+   (segment_iter_offset((bvec), (iter)) % PAGE_SIZE)
+
+#define bvec_iter_len(bvec, iter)  \
+   min_t(unsigned, segment_iter_len((bvec), (iter)),   \
+ PAGE_SIZE - bvec_iter_offset((bvec), (iter)))
+
+#define bvec_iter_page(bvec, iter) \
+   nth_page(segment_iter_page((bvec), (iter)), \
+segment_iter_page_idx((bvec), (iter)))
+
 #define bvec_iter_bvec(bvec, iter) \
 ((struct bio_vec) {\
.bv_page= bvec_iter_page((bvec), (iter)),   \
-- 
2.9.5



[Cluster-devel] [PATCH V11 01/19] block: don't use bio->bi_vcnt to figure out segment number

2018-11-20 Thread Ming Lei
It is wrong to use bio->bi_vcnt to figure out how many segments
there are in the bio even though CLONED flag isn't set on this bio,
because this bio may be splitted or advanced.

So always use bio_segments() in blk_recount_segments(), and it shouldn't
cause any performance loss now because the physical segment number is figured
out in blk_queue_split() and BIO_SEG_VALID is set meantime since
bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").

Reviewed-by: Christoph Hellwig 
Fixes: 76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max 
segments")
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index b1df622cbd85..f52400ce2187 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -368,13 +368,7 @@ void blk_recalc_rq_segments(struct request *rq)
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-   unsigned short seg_cnt;
-
-   /* estimate segment number by bi_vcnt for non-cloned bio */
-   if (bio_flagged(bio, BIO_CLONED))
-   seg_cnt = bio_segments(bio);
-   else
-   seg_cnt = bio->bi_vcnt;
+   unsigned short seg_cnt = bio_segments(bio);
 
if (test_bit(QUEUE_FLAG_NO_SG_MERGE, >queue_flags) &&
(seg_cnt < queue_max_segments(q)))
-- 
2.9.5



Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Sagi Grimberg




Not sure I understand the 'blocking' problem in this case.

We can build a bvec table from this req, and send them all
in send(),


I would like to avoid growing bvec tables and keep everything
preallocated. Plus, a bvec_iter operates on a bvec which means
we'll need a table there as well... Not liking it so far...


can this way avoid your blocking issue? You may see this
example in branch 'rq->bio != rq->biotail' of lo_rw_aio().


This is exactly an example of not ignoring the bios...


If this way is what you need, I think you are right, even we may
introduce the following helpers:

rq_for_each_bvec()
rq_bvecs()


I'm not sure how this helps me either. Unless we can set a bvec_iter to
span bvecs or have an abstract bio crossing when we re-initialize the
bvec_iter I don't see how I can ignore bios completely...


So looks nvme-tcp host driver might be the 2nd driver which benefits
from multi-page bvec directly.

The multi-page bvec V11 has passed my tests and addressed almost
all the comments during review on V10. I removed bio_vecs() in V11,
but it won't be big deal, we can introduce them anytime when there
is the requirement.


multipage-bvecs and nvme-tcp are going to conflict, so it would be good
to coordinate on this. I think that nvme-tcp host needs some adjustments
as setting a bvec_iter. I'm under the impression that the change is 
rather small and self-contained, but I'm not sure I have the full

picture here.



Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Ming Lei
On Tue, Nov 20, 2018 at 12:11:35PM -0800, Sagi Grimberg wrote:
> 
> > > > The only user in your final tree seems to be the loop driver, and
> > > > even that one only uses the helper for read/write bios.
> > > > 
> > > > I think something like this would be much simpler in the end:
> > > 
> > > The recently submitted nvme-tcp host driver should also be a user
> > > of this. Does it make sense to keep it as a helper then?
> > 
> > I did take a brief look at the code, and I really don't understand
> > why the heck it even deals with bios to start with.  Like all the
> > other nvme transports it is a blk-mq driver and should iterate
> > over segments in a request and more or less ignore bios.  Something
> > is horribly wrong in the design.
> 
> Can you explain a little more? I'm more than happy to change that but
> I'm not completely clear how...
> 
> Before we begin a data transfer, we need to set our own iterator that
> will advance with the progression of the data transfer. We also need to
> keep in mind that all the data transfer (both send and recv) are
> completely non blocking (and zero-copy when we send).
> 
> That means that every data movement needs to be able to suspend
> and resume asynchronously. i.e. we cannot use the following pattern:
> rq_for_each_segment(bvec, rq, rq_iter) {
>   iov_iter_bvec(_iter, WRITE, , 1, bvec.bv_len);
>   send(sock, iov_iter);
> }

Not sure I understand the 'blocking' problem in this case.

We can build a bvec table from this req, and send them all
in send(), can this way avoid your blocking issue? You may see this
example in branch 'rq->bio != rq->biotail' of lo_rw_aio().

If this way is what you need, I think you are right, even we may
introduce the following helpers:

rq_for_each_bvec()
rq_bvecs()

So looks nvme-tcp host driver might be the 2nd driver which benefits
from multi-page bvec directly.

The multi-page bvec V11 has passed my tests and addressed almost
all the comments during review on V10. I removed bio_vecs() in V11,
but it won't be big deal, we can introduce them anytime when there
is the requirement.

Thanks,
Ming



[Cluster-devel] FYI: dlm.service possibly silenty overwritten with DisplayLink driver installer

2018-11-20 Thread Jan Pokorný
Accidentally, when searching for something systemd related, dlm.service
caught my eye, and surprisingly, it was rather in a HW support in Linux
SW enablement context.  Briefly looking into the Ubuntu driver that
allegedly contained that file (or recipe to create it, actually), I've
realized the respective installer would simply overwrite the regular
cluster DLM related service file without any hesitations
(unless I miss something).

So take this as a heads-up to watch out for that circumstance, three
letter acronyms are apparently not very namespace-collision-proof.

I logged this with their "Feature Suggestion" asking for more
carefulness:
https://support.displaylink.com/forums/287786-displaylink-feature-suggestions/suggestions/36068896

-- 
Jan (Poki)


pgpiixHV8qrEm.pgp
Description: PGP signature


Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Sagi Grimberg




The only user in your final tree seems to be the loop driver, and
even that one only uses the helper for read/write bios.

I think something like this would be much simpler in the end:


The recently submitted nvme-tcp host driver should also be a user
of this. Does it make sense to keep it as a helper then?


I did take a brief look at the code, and I really don't understand
why the heck it even deals with bios to start with.  Like all the
other nvme transports it is a blk-mq driver and should iterate
over segments in a request and more or less ignore bios.  Something
is horribly wrong in the design.


Can you explain a little more? I'm more than happy to change that but
I'm not completely clear how...

Before we begin a data transfer, we need to set our own iterator that
will advance with the progression of the data transfer. We also need to
keep in mind that all the data transfer (both send and recv) are
completely non blocking (and zero-copy when we send).

That means that every data movement needs to be able to suspend
and resume asynchronously. i.e. we cannot use the following pattern:
rq_for_each_segment(bvec, rq, rq_iter) {
iov_iter_bvec(_iter, WRITE, , 1, bvec.bv_len);
send(sock, iov_iter);
}

Given that a request can hold more than a single bio, I'm not clear on
how we can achieve that without iterating over the bios in the request
ourselves.

Any useful insight?



Re: [Cluster-devel] [PATCH 0/2] gfs2: improvements to recovery and withdraw process

2018-11-20 Thread Steven Whitehouse

Hi,


On 08/11/18 20:25, Bob Peterson wrote:

Hi,

This is a first draft of a two-patch set to fix some of the nasty
journal recovery problems I've found lately.

The problems have to do with file system corruption caused when recovery
replays a journal after the resource group blocks have been unlocked
by the recovery process. In other words, when no cluster node takes
responsibility to replay the journal of a withdrawing node, then it
gets replayed later on, after the blocks contents have been changed.

The first patch prevents gfs2 from attempting recovery if the file system
is withdrawn or has journal IO errors. Trying to recover your own journal
from either of these unstable conditions is dangerous and likely to corrupt
the file system.

That sounds sensible to me.


The second patch is more extensive. When a node withdraws from a file system
it first empties out all ourstanding pages in the ail lists, then it
How are we doing this? Since the disk can no longer be written to, there 
are two cases we need to cover. One is for dirty but not yet written 
pages. The other for pages in flight - these will need to either time 
out or complete somehow.



signals all other nodes with the file system mounted to perform recovery
on its journal since it cannot safely recover its own journal. This is
accomplished by a new non-disk callback glop used exclusively by the
"live" glock, which sets up an lvb in the glock to indicate which
journal(s) need to be replayed. This sytem makes it necessary to prevent
recursion, since the journal operations themselves (i.e. the ones that
empty out the ail list on withdraw) can also withdraw. Thus, the withdraw
We should ignore any further I/O errors after we have withdrawn I think, 
since we know that no further disk writes can take place anyway. These 
will be completed as EIO by dm. As you say we definitely don't want the 
node that is withdrawing replaying its own journal. That should be done 
by the remaining nodes in the cluster.


The other question is should we just use the "normal" recovery process 
which would fence the withdrawn node, or whether we should have a 
different system which avoids the fencing, since we have effectively 
self-fenced from the storage. Looking at the patch I assume that perhaps 
this implements the latter?


Steve.



system is now separated into "journal" and "non-journal" withdraws.
Also, the "withdraw" flag is now replaced by a superblock bit because
once the file system withdraws in this way, it needs to remember that from
that point on.

Regards,

Bob Peterson
---
Bob Peterson (2):
   gfs2: Ignore recovery attempts if gfs2 has io error or is withdrawn
   gfs2: initiate journal recovery as soon as a node withdraws

  fs/gfs2/glock.c|  5 ++-
  fs/gfs2/glops.c| 47 +++
  fs/gfs2/incore.h   |  3 ++
  fs/gfs2/lock_dlm.c | 95 ++
  fs/gfs2/log.c  | 62 --
  fs/gfs2/super.c|  5 ++-
  fs/gfs2/super.h|  1 +
  fs/gfs2/util.c | 84 
  fs/gfs2/util.h | 13 +++
  9 files changed, 282 insertions(+), 33 deletions(-)





Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Christoph Hellwig
On Mon, Nov 19, 2018 at 04:49:27PM -0800, Sagi Grimberg wrote:
>
>> The only user in your final tree seems to be the loop driver, and
>> even that one only uses the helper for read/write bios.
>>
>> I think something like this would be much simpler in the end:
>
> The recently submitted nvme-tcp host driver should also be a user
> of this. Does it make sense to keep it as a helper then?

I did take a brief look at the code, and I really don't understand
why the heck it even deals with bios to start with.  Like all the
other nvme transports it is a blk-mq driver and should iterate
over segments in a request and more or less ignore bios.  Something
is horribly wrong in the design.



Re: [Cluster-devel] [PATCH 03/13] GFS2: Eliminate a goto in finish_xmote

2018-11-20 Thread Steven Whitehouse

Hi,


On 19/11/18 21:26, Bob Peterson wrote:

Hi,

- Original Message -


On 19/11/18 13:29, Bob Peterson wrote:

This is another baby step toward a better glock state machine.
This patch eliminates a goto in function finish_xmote so we can
begin unraveling the cryptic logic with later patches.

Signed-off-by: Bob Peterson 
---
   fs/gfs2/glock.c | 11 +--
   1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 5f2156f15f05..6e9d53583b73 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -472,11 +472,11 @@ static void finish_xmote(struct gfs2_glock *gl,
unsigned int ret)
list_move_tail(>gh_list, 
>gl_holders);
gh = find_first_waiter(gl);
gl->gl_target = gh->gh_state;
-   goto retry;
-   }
-   /* Some error or failed "try lock" - report it */
-   if ((ret & LM_OUT_ERROR) ||
-   (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB))) {
+   state = LM_ST_UNLOCKED;

I'm not sure what you are trying to achieve here, but setting the state
to LM_ST_UNLOCKED when it is quite possible that it is not that state,
doesn't really seem to improve anything. Indeed, it looks more confusing
to me, at least it was fairly clear before that the intent was to retry
the operation which has been canceled.

When finish_xmote hits this affected section of code, it's because the dlm
returned a state different from the intended state. Before this patch, it
did "goto retry" which jumps to the label inside the switch state that
handles LM_ST_UNLOCKED, after which it simply unlocks and returns.

Changing local variable "state" merely forces the code to take the same
codepath in which it calls do_xmote, unlocking and returning as it does today,
but without the goto. This makes the function more suitable to the new
autonomous state machine, which is added in a later patch.

The addition of "else if" is needed so it doesn't go down the wrong code path
at the comment: /* Some error or failed "try lock" - report it */
The logic is a bit tricky here, but is preserved from the original.

Most of the subsequent patches aren't quite as mind-bending, I promise. :)

Regards,

Bob Peterson
I can see that it is doing the same thing as before, but it is less 
clear why. The point about the retry label is that it is telling us what 
is going to do. Setting the state to LM_ST_UNLOCKED is more confusing, 
because the state might not be LM_ST_UNLOCKED at this point, and you are 
now forcing that state in order to get the same code path as before. 
There is no real advantage compared with the previous code that I can 
see, except that it is more confusing,


Steve.



Re: [Cluster-devel] [PATCH 02/13] GFS2: Make do_xmote determine its own gh parameter

2018-11-20 Thread Steven Whitehouse

Hi,


On 19/11/18 21:06, Bob Peterson wrote:

Hi Steve,

- Original Message -


On 19/11/18 13:29, Bob Peterson wrote:

This is another baby step toward a better glock state machine.
Before this patch, do_xmote was called with a gh parameter, but
only for promotes, not demotes. This patch allows do_xmote to
determine the gh autonomously.

Signed-off-by: Bob Peterson 

(snip)


Since gh is apparently only used to get the lock flags, it would make
more sense just to pass the lock flags rather than add in an additional
find_first_waiter() call,

Steve.

Perhaps I didn't put enough info into the comments for this patch.

I need to get rid of the gh parameter in order to make the glock
state machine fully autonomous. In other words, function do_xmote will
become a state in the (stand alone) state machine, which itself does not
require a gh parameter and may be called from several places under
several conditions. The state of the glock will determine that it needs
to call do_xmote, but do_xmote needs to figure it out on its own.
A function can't become a state in this sense. The state in this case is 
the content of struct gfs2_glock, and the

functions define how you get from one state to another,



Before this patch, the caller does indeed know the gh pointer, but in
the future, it will replaced by a generic call to the state machine
which will not know it.

Regards,

Bob Peterson


That is not relevant to the point I was making though. The point is that 
if the flags are passed to do_xmote rather than the gh, then that 
resolves the issue of needing to pass the gh and reduces the amount of 
code, since you can pass 0 flags instead of NULL gh,


Steve.



Re: [Cluster-devel] gfs2: Remove vestigial bd_ops (version 2)

2018-11-20 Thread Steven Whitehouse




On 20/11/18 14:37, Bob Peterson wrote:

Hi,

Here is a new and improved version of the patch I posted on
16 November. Since the field is no longer needed, neither are
the function parameters used to allocate a bd.
---
Field bd_ops was set but never used, so I removed it, and all
code supporting it.

Signed-off-by: Bob Peterson 

Acked-by: Steven Whitehouse 

Steve.


---
  fs/gfs2/incore.h | 1 -
  fs/gfs2/log.c| 1 -
  fs/gfs2/trans.c  | 8 +++-
  3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 888b62cfd6d1..663759abe60d 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -165,7 +165,6 @@ struct gfs2_bufdata {
u64 bd_blkno;
  
  	struct list_head bd_list;

-   const struct gfs2_log_operations *bd_ops;
  
  	struct gfs2_trans *bd_tr;

struct list_head bd_ail_st_list;
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 4dcd2b48189e..5bfaf381921a 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -605,7 +605,6 @@ void gfs2_add_revoke(struct gfs2_sbd *sdp, struct 
gfs2_bufdata *bd)
bd->bd_blkno = bh->b_blocknr;
gfs2_remove_from_ail(bd); /* drops ref on bh */
bd->bd_bh = NULL;
-   bd->bd_ops = _revoke_lops;
sdp->sd_log_num_revoke++;
atomic_inc(>gl_revokes);
set_bit(GLF_LFLUSH, >gl_flags);
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index 423bc2d03dd8..cd9a94a6b5bb 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -124,15 +124,13 @@ void gfs2_trans_end(struct gfs2_sbd *sdp)
  }
  
  static struct gfs2_bufdata *gfs2_alloc_bufdata(struct gfs2_glock *gl,

-  struct buffer_head *bh,
-  const struct gfs2_log_operations 
*lops)
+  struct buffer_head *bh)
  {
struct gfs2_bufdata *bd;
  
  	bd = kmem_cache_zalloc(gfs2_bufdata_cachep, GFP_NOFS | __GFP_NOFAIL);

bd->bd_bh = bh;
bd->bd_gl = gl;
-   bd->bd_ops = lops;
INIT_LIST_HEAD(>bd_list);
bh->b_private = bd;
return bd;
@@ -169,7 +167,7 @@ void gfs2_trans_add_data(struct gfs2_glock *gl, struct 
buffer_head *bh)
gfs2_log_unlock(sdp);
unlock_buffer(bh);
if (bh->b_private == NULL)
-   bd = gfs2_alloc_bufdata(gl, bh, _databuf_lops);
+   bd = gfs2_alloc_bufdata(gl, bh);
else
bd = bh->b_private;
lock_buffer(bh);
@@ -210,7 +208,7 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct 
buffer_head *bh)
unlock_buffer(bh);
lock_page(bh->b_page);
if (bh->b_private == NULL)
-   bd = gfs2_alloc_bufdata(gl, bh, _buf_lops);
+   bd = gfs2_alloc_bufdata(gl, bh);
else
bd = bh->b_private;
unlock_page(bh->b_page);





[Cluster-devel] gfs2: Remove vestigial bd_ops (version 2)

2018-11-20 Thread Bob Peterson
Hi,

Here is a new and improved version of the patch I posted on
16 November. Since the field is no longer needed, neither are
the function parameters used to allocate a bd.
---
Field bd_ops was set but never used, so I removed it, and all
code supporting it.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/incore.h | 1 -
 fs/gfs2/log.c| 1 -
 fs/gfs2/trans.c  | 8 +++-
 3 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 888b62cfd6d1..663759abe60d 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -165,7 +165,6 @@ struct gfs2_bufdata {
u64 bd_blkno;
 
struct list_head bd_list;
-   const struct gfs2_log_operations *bd_ops;
 
struct gfs2_trans *bd_tr;
struct list_head bd_ail_st_list;
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 4dcd2b48189e..5bfaf381921a 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -605,7 +605,6 @@ void gfs2_add_revoke(struct gfs2_sbd *sdp, struct 
gfs2_bufdata *bd)
bd->bd_blkno = bh->b_blocknr;
gfs2_remove_from_ail(bd); /* drops ref on bh */
bd->bd_bh = NULL;
-   bd->bd_ops = _revoke_lops;
sdp->sd_log_num_revoke++;
atomic_inc(>gl_revokes);
set_bit(GLF_LFLUSH, >gl_flags);
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index 423bc2d03dd8..cd9a94a6b5bb 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -124,15 +124,13 @@ void gfs2_trans_end(struct gfs2_sbd *sdp)
 }
 
 static struct gfs2_bufdata *gfs2_alloc_bufdata(struct gfs2_glock *gl,
-  struct buffer_head *bh,
-  const struct gfs2_log_operations 
*lops)
+  struct buffer_head *bh)
 {
struct gfs2_bufdata *bd;
 
bd = kmem_cache_zalloc(gfs2_bufdata_cachep, GFP_NOFS | __GFP_NOFAIL);
bd->bd_bh = bh;
bd->bd_gl = gl;
-   bd->bd_ops = lops;
INIT_LIST_HEAD(>bd_list);
bh->b_private = bd;
return bd;
@@ -169,7 +167,7 @@ void gfs2_trans_add_data(struct gfs2_glock *gl, struct 
buffer_head *bh)
gfs2_log_unlock(sdp);
unlock_buffer(bh);
if (bh->b_private == NULL)
-   bd = gfs2_alloc_bufdata(gl, bh, _databuf_lops);
+   bd = gfs2_alloc_bufdata(gl, bh);
else
bd = bh->b_private;
lock_buffer(bh);
@@ -210,7 +208,7 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct 
buffer_head *bh)
unlock_buffer(bh);
lock_page(bh->b_page);
if (bh->b_private == NULL)
-   bd = gfs2_alloc_bufdata(gl, bh, _buf_lops);
+   bd = gfs2_alloc_bufdata(gl, bh);
else
bd = bh->b_private;
unlock_page(bh->b_page);



Re: [Cluster-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-20 Thread Sagi Grimberg




The only user in your final tree seems to be the loop driver, and
even that one only uses the helper for read/write bios.

I think something like this would be much simpler in the end:


The recently submitted nvme-tcp host driver should also be a user
of this. Does it make sense to keep it as a helper then?



Re: [Cluster-devel] [PATCH V10 15/19] block: always define BIO_MAX_PAGES as 256

2018-11-20 Thread Huang, Ying
Ming Lei  writes:

> On Thu, Nov 15, 2018 at 05:59:36PM -0800, Omar Sandoval wrote:
>> On Thu, Nov 15, 2018 at 04:53:02PM +0800, Ming Lei wrote:
>> > Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
>> > increase BIO_MAX_PAGES for it.
>> 
>> You mentioned to it in the cover letter, but this needs more explanation
>> in the commit message. Why did CONFIG_THP_SWAP require > 256? Why does
>> multipage bvecs remove that requirement?
>
> CONFIG_THP_SWAP needs to split one TH page into normal pages and adds
> them all to one bio. With multipage-bvec, it just takes one bvec to
> hold them all.

Yes.  CONFIG_THP_SWAP needs to put 512 normal sub-pages into one bio to
write the 512 sub-pages together.  With the help of multipage-bvec, it
needs just bvect to hold 512 normal sub-pages.

Best Regards,
Huang, Ying

> thanks,
> Ming