Re: [dm-devel] [PATCH v2] dm: Check for device sector overflow if CONFIG_LBDAF is not set

2018-11-15 Thread Mikulas Patocka



On Wed, 7 Nov 2018, Milan Broz wrote:

> Reference to a device in device-mapper table contains offset in sectors.
> 
> If the sector_t is 32bit integer (CONFIG_LBDAF is not set), then
> several device-mapper targets can overflow this offset and validity
> check is then performed on a wrong offset and a wrong table is activated.
> 
> See for example (on 32bit without CONFIG_LBDAF) this overflow:
> 
>   # dmsetup create test --table "0 2048 linear /dev/sdg 4294967297"
>   # dmsetup table test
>   0 2048 linear 8:96 1
> 
> This patch adds explicit check for overflow if the offset is sector_t type.
> 
> Signed-off-by: Milan Broz 

Reviewed-by: Mikulas Patocka 

> ---
>  drivers/md/dm-crypt.c| 2 +-
>  drivers/md/dm-delay.c| 2 +-
>  drivers/md/dm-flakey.c   | 2 +-
>  drivers/md/dm-linear.c   | 2 +-
>  drivers/md/dm-raid1.c| 3 ++-
>  drivers/md/dm-unstripe.c | 2 +-
>  6 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 49be7a6a2e81..a41fe7975dc6 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -2781,7 +2781,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>   }
>  
>   ret = -EINVAL;
> - if (sscanf(argv[4], "%llu%c", , ) != 1) {
> + if (sscanf(argv[4], "%llu%c", , ) != 1 || tmpll != 
> (sector_t)tmpll) {
>   ti->error = "Invalid device sector";
>   goto bad;
>   }
> diff --git a/drivers/md/dm-delay.c b/drivers/md/dm-delay.c
> index 2fb7bb4304ad..fddffe251bf6 100644
> --- a/drivers/md/dm-delay.c
> +++ b/drivers/md/dm-delay.c
> @@ -141,7 +141,7 @@ static int delay_class_ctr(struct dm_target *ti, struct 
> delay_class *c, char **a
>   unsigned long long tmpll;
>   char dummy;
>  
> - if (sscanf(argv[1], "%llu%c", , ) != 1) {
> + if (sscanf(argv[1], "%llu%c", , ) != 1 || tmpll != 
> (sector_t)tmpll) {
>   ti->error = "Invalid device sector";
>   return -EINVAL;
>   }
> diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c
> index 3cb97fa4c11d..8261aa8c7fe1 100644
> --- a/drivers/md/dm-flakey.c
> +++ b/drivers/md/dm-flakey.c
> @@ -213,7 +213,7 @@ static int flakey_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>   devname = dm_shift_arg();
>  
>   r = -EINVAL;
> - if (sscanf(dm_shift_arg(), "%llu%c", , ) != 1) {
> + if (sscanf(dm_shift_arg(), "%llu%c", , ) != 1 || tmpll 
> != (sector_t)tmpll) {
>   ti->error = "Invalid device sector";
>   goto bad;
>   }
> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c
> index 8d7ddee6ac4d..ad980a38fb1e 100644
> --- a/drivers/md/dm-linear.c
> +++ b/drivers/md/dm-linear.c
> @@ -45,7 +45,7 @@ static int linear_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>   }
>  
>   ret = -EINVAL;
> - if (sscanf(argv[1], "%llu%c", , ) != 1) {
> + if (sscanf(argv[1], "%llu%c", , ) != 1 || tmp != 
> (sector_t)tmp) {
>   ti->error = "Invalid device sector";
>   goto bad;
>   }
> diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
> index 79eab1071ec2..5a51151f680d 100644
> --- a/drivers/md/dm-raid1.c
> +++ b/drivers/md/dm-raid1.c
> @@ -943,7 +943,8 @@ static int get_mirror(struct mirror_set *ms, struct 
> dm_target *ti,
>   char dummy;
>   int ret;
>  
> - if (sscanf(argv[1], "%llu%c", , ) != 1) {
> + if (sscanf(argv[1], "%llu%c", , ) != 1 ||
> + offset != (sector_t)offset) {
>   ti->error = "Invalid offset";
>   return -EINVAL;
>   }
> diff --git a/drivers/md/dm-unstripe.c b/drivers/md/dm-unstripe.c
> index 954b7ab4e684..e673dacf6418 100644
> --- a/drivers/md/dm-unstripe.c
> +++ b/drivers/md/dm-unstripe.c
> @@ -78,7 +78,7 @@ static int unstripe_ctr(struct dm_target *ti, unsigned int 
> argc, char **argv)
>   goto err;
>   }
>  
> - if (sscanf(argv[4], "%llu%c", , ) != 1) {
> + if (sscanf(argv[4], "%llu%c", , ) != 1 || start != 
> (sector_t)start) {
>   ti->error = "Invalid striped device offset";
>   goto err;
>   }
> -- 
> 2.19.1
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH] nvme: allow ANA support to be independent of native multipathing

2018-11-15 Thread Mike Snitzer
Whether or not ANA is present is a choice of the target implementation;
the host (and whether it supports multipathing) has _zero_ influence on
this.  If the target declares a path as 'inaccessible' the path _is_
inaccessible to the host.  As such, ANA support should be functional
even if native multipathing is not.

Introduce ability to always re-read ANA log page as required due to ANA
error and make current ANA state available via sysfs -- even if native
multipathing is disabled on the host (e.g. nvme_core.multipath=N).

This affords userspace access to the current ANA state independent of
which layer might be doing multipathing.  It also allows multipath-tools
to rely on the NVMe driver for ANA support while dm-multipath takes care
of multipathing.

While implementing these changes care was taken to preserve the exact
ANA functionality and code sequence native multipathing has provided.
This manifests as native multipathing's nvme_failover_req() being
tweaked to call __nvme_update_ana() which was factored out to allow
nvme_update_ana() to be called independent of nvme_failover_req().

And as always, if embedded NVMe users do not want any performance
overhead associated with ANA or native NVMe multipathing they can
disable CONFIG_NVME_MULTIPATH.

Signed-off-by: Mike Snitzer 
---
 drivers/nvme/host/core.c  | 10 +
 drivers/nvme/host/multipath.c | 49 +--
 drivers/nvme/host/nvme.h  |  4 
 3 files changed, 48 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fe957166c4a9..3df607905628 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -255,10 +255,12 @@ void nvme_complete_rq(struct request *req)
nvme_req(req)->ctrl->comp_seen = true;
 
if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {
-   if ((req->cmd_flags & REQ_NVME_MPATH) &&
-   blk_path_error(status)) {
-   nvme_failover_req(req);
-   return;
+   if (blk_path_error(status)) {
+   if (req->cmd_flags & REQ_NVME_MPATH) {
+   nvme_failover_req(req);
+   return;
+   }
+   nvme_update_ana(req);
}
 
if (!blk_queue_dying(req->q)) {
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 8e03cda770c5..0adbcff5fba2 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -22,7 +22,7 @@ MODULE_PARM_DESC(multipath,
 
 inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)
 {
-   return multipath && ctrl->subsys && (ctrl->subsys->cmic & (1 << 3));
+   return ctrl->subsys && (ctrl->subsys->cmic & (1 << 3));
 }
 
 /*
@@ -47,6 +47,35 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
}
 }
 
+static bool nvme_ana_error(u16 status)
+{
+   switch (status & 0x7ff) {
+   case NVME_SC_ANA_TRANSITION:
+   case NVME_SC_ANA_INACCESSIBLE:
+   case NVME_SC_ANA_PERSISTENT_LOSS:
+   return true;
+   }
+   return false;
+}
+
+static void __nvme_update_ana(struct nvme_ns *ns)
+{
+   if (!ns->ctrl->ana_log_buf)
+   return;
+
+   set_bit(NVME_NS_ANA_PENDING, >flags);
+   queue_work(nvme_wq, >ctrl->ana_work);
+}
+
+void nvme_update_ana(struct request *req)
+{
+   struct nvme_ns *ns = req->q->queuedata;
+   u16 status = nvme_req(req)->status;
+
+   if (nvme_ana_error(status))
+   __nvme_update_ana(ns);
+}
+
 void nvme_failover_req(struct request *req)
 {
struct nvme_ns *ns = req->q->queuedata;
@@ -58,25 +87,22 @@ void nvme_failover_req(struct request *req)
spin_unlock_irqrestore(>head->requeue_lock, flags);
blk_mq_end_request(req, 0);
 
-   switch (status & 0x7ff) {
-   case NVME_SC_ANA_TRANSITION:
-   case NVME_SC_ANA_INACCESSIBLE:
-   case NVME_SC_ANA_PERSISTENT_LOSS:
+   if (nvme_ana_error(status)) {
/*
 * If we got back an ANA error we know the controller is alive,
 * but not ready to serve this namespaces.  The spec suggests
 * we should update our general state here, but due to the fact
 * that the admin and I/O queues are not serialized that is
 * fundamentally racy.  So instead just clear the current path,
-* mark the the path as pending and kick of a re-read of the ANA
+* mark the path as pending and kick off a re-read of the ANA
 * log page ASAP.
 */
nvme_mpath_clear_current_path(ns);
-   if (ns->ctrl->ana_log_buf) {
-   set_bit(NVME_NS_ANA_PENDING, >flags);
-   queue_work(nvme_wq, >ctrl->ana_work);
-   }
-   break;
+   

Re: [dm-devel] [PATCH V10 03/19] block: use bio_for_each_bvec() to compute multi-page bvec count

2018-11-15 Thread Mike Snitzer
On Thu, Nov 15 2018 at  3:20pm -0500,
Omar Sandoval  wrote:

> On Thu, Nov 15, 2018 at 04:52:50PM +0800, Ming Lei wrote:
> > First it is more efficient to use bio_for_each_bvec() in both
> > blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
> > many multi-page bvecs there are in the bio.
> > 
> > Secondly once bio_for_each_bvec() is used, the bvec may need to be
> > splitted because its length can be very longer than max segment size,
> > so we have to split the big bvec into several segments.
> > 
> > Thirdly when splitting multi-page bvec into segments, the max segment
> > limit may be reached, so the bio split need to be considered under
> > this situation too.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: Mike Snitzer 
> > Cc: dm-devel@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-de...@redhat.com
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-merge.c | 90 
> > ++-
> >  1 file changed, 76 insertions(+), 14 deletions(-)
> > 
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 91b2af332a84..6f7deb94a23f 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -160,6 +160,62 @@ static inline unsigned get_max_io_size(struct 
> > request_queue *q,
> > return sectors;
> >  }
> >  
> > +/*
> > + * Split the bvec @bv into segments, and update all kinds of
> > + * variables.
> > + */
> > +static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
> > +   unsigned *nsegs, unsigned *last_seg_size,
> > +   unsigned *front_seg_size, unsigned *sectors)
> > +{
> > +   bool need_split = false;
> > +   unsigned len = bv->bv_len;
> > +   unsigned total_len = 0;
> > +   unsigned new_nsegs = 0, seg_size = 0;
> 
> "unsigned int" here and everywhere else.

Curious why?  I've wondered what govens use of "unsigned" vs "unsigned
int" recently and haven't found _the_ reason to pick one over the other.

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH 0/3] device mapper percpu counter patches

2018-11-15 Thread Mikulas Patocka
Hi

These are the device mapper percpu counter patches.
They are on the top of
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.21

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH 1/3] dm: move dm_stats_account_io before generic_end_io_acct

2018-11-15 Thread Mikulas Patocka
Make sure that the statistics are not updated while the device is
suspended. So, we move statistics update before generic_end_io_acct.

Signed-off-by: Mikulas Patocka 

---
 drivers/md/dm.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-dm/drivers/md/dm.c
===
--- linux-dm.orig/drivers/md/dm.c   2018-11-15 21:56:36.0 +0100
+++ linux-dm/drivers/md/dm.c2018-11-15 21:56:36.0 +0100
@@ -674,6 +674,11 @@ static void end_io_acct(struct dm_io *io
struct bio *bio = io->orig_bio;
unsigned long duration = jiffies - io->start_time;
 
+   if (unlikely(dm_stats_used(>stats)))
+   dm_stats_account_io(>stats, bio_data_dir(bio),
+   bio->bi_iter.bi_sector, bio_sectors(bio),
+   true, duration, >stats_aux);
+
/*
 * make sure that atomic_dec in generic_end_io_acct is not reordered
 * with previous writes
@@ -687,11 +692,6 @@ static void end_io_acct(struct dm_io *io
 */
smp_mb__after_atomic();
 
-   if (unlikely(dm_stats_used(>stats)))
-   dm_stats_account_io(>stats, bio_data_dir(bio),
-   bio->bi_iter.bi_sector, bio_sectors(bio),
-   true, duration, >stats_aux);
-
/* nudge anyone waiting on suspend queue */
if (unlikely(waitqueue_active(>wait))) {
if (!md_in_flight(md))

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH 3/3] block: use a driver-specific handler for the "inflight" value

2018-11-15 Thread Mikulas Patocka
Device mapper was converted to percpu inflight counters. In order to
display the correct values in the "inflight" sysfs file and in
/proc/diskstats, we need a custom callback that sums the percpu counters.

The function part_round_stats calculates the number of in-flight I/Os
every jiffy and uses this to calculate the counters time_in_queue and
io_ticks. In order to avoid excessive memory traffic on systems with high
number of CPUs, this functionality is disabled when percpu inflight values
are used and the values time_in_queue and io_ticks are calculated
differently - the result is less precise.

We add the duration of an I/O to time_in_queue when the I/O finishes (the
value is almost the same as previously, except for the time of in-flight
I/Os).

If an I/O starts or finishes and the "jiffies" value has changed, we add
one to io_ticks. If the I/Os take less than a jiffy, the value is as exact
as the previous value. If the I/Os take more than a jiffy, the value may
lag behind the previous value.

Signed-off-by: Mikulas Patocka 

---
 block/blk-core.c   |7 ++-
 block/blk-settings.c   |6 ++
 block/genhd.c  |   12 
 drivers/md/dm.c|   37 +++--
 include/linux/blkdev.h |3 +++
 5 files changed, 62 insertions(+), 3 deletions(-)

Index: linux-dm/block/genhd.c
===
--- linux-dm.orig/block/genhd.c 2018-11-15 22:11:51.0 +0100
+++ linux-dm/block/genhd.c  2018-11-15 22:11:51.0 +0100
@@ -68,6 +68,13 @@ void part_dec_in_flight(struct request_q
 void part_in_flight(struct request_queue *q, struct hd_struct *part,
unsigned int inflight[2])
 {
+   if (q->get_inflight_fn) {
+   q->get_inflight_fn(q, inflight);
+   inflight[0] += inflight[1];
+   inflight[1] = 0;
+   return;
+   }
+
if (q->mq_ops) {
blk_mq_in_flight(q, part, inflight);
return;
@@ -85,6 +92,11 @@ void part_in_flight(struct request_queue
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
   unsigned int inflight[2])
 {
+   if (q->get_inflight_fn) {
+   q->get_inflight_fn(q, inflight);
+   return;
+   }
+
if (q->mq_ops) {
blk_mq_in_flight_rw(q, part, inflight);
return;
Index: linux-dm/include/linux/blkdev.h
===
--- linux-dm.orig/include/linux/blkdev.h2018-11-15 22:11:51.0 
+0100
+++ linux-dm/include/linux/blkdev.h 2018-11-15 22:11:51.0 +0100
@@ -286,6 +286,7 @@ struct blk_queue_ctx;
 
 typedef blk_qc_t (make_request_fn) (struct request_queue *q, struct bio *bio);
 typedef bool (poll_q_fn) (struct request_queue *q, blk_qc_t);
+typedef void (get_inflight_fn)(struct request_queue *, unsigned int [2]);
 
 struct bio_vec;
 typedef int (dma_drain_needed_fn)(struct request *);
@@ -405,6 +406,7 @@ struct request_queue {
make_request_fn *make_request_fn;
poll_q_fn   *poll_fn;
dma_drain_needed_fn *dma_drain_needed;
+   get_inflight_fn *get_inflight_fn;
 
const struct blk_mq_ops *mq_ops;
 
@@ -1099,6 +1101,7 @@ extern void blk_queue_update_dma_alignme
 extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
 extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
 extern void blk_queue_write_cache(struct request_queue *q, bool enabled, bool 
fua);
+extern void blk_queue_get_inflight(struct request_queue *, get_inflight_fn *);
 
 /*
  * Number of physical segments as sent to the device.
Index: linux-dm/block/blk-settings.c
===
--- linux-dm.orig/block/blk-settings.c  2018-11-15 22:11:51.0 +0100
+++ linux-dm/block/blk-settings.c   2018-11-15 22:11:51.0 +0100
@@ -849,6 +849,12 @@ void blk_queue_write_cache(struct reques
 }
 EXPORT_SYMBOL_GPL(blk_queue_write_cache);
 
+void blk_queue_get_inflight(struct request_queue *q, get_inflight_fn *fn)
+{
+   q->get_inflight_fn = fn;
+}
+EXPORT_SYMBOL_GPL(blk_queue_get_inflight);
+
 static int __init blk_settings_init(void)
 {
blk_max_low_pfn = max_low_pfn - 1;
Index: linux-dm/drivers/md/dm.c
===
--- linux-dm.orig/drivers/md/dm.c   2018-11-15 22:11:51.0 +0100
+++ linux-dm/drivers/md/dm.c2018-11-15 22:18:44.0 +0100
@@ -657,18 +657,30 @@ int md_in_flight(struct mapped_device *m
return (int)sum;
 }
 
+static void test_io_ticks(int cpu, struct hd_struct *part, unsigned long now)
+{
+   unsigned long stamp = READ_ONCE(part->stamp);
+   if (unlikely(stamp != now)) {
+   if (likely(cmpxchg(>stamp, stamp, now) == stamp)) {
+   

Re: [dm-devel] [patch 0/5] device mapper percpu patches

2018-11-15 Thread Mikulas Patocka



On Wed, 7 Nov 2018, Jens Axboe wrote:

> On 11/7/18 3:47 PM, Mikulas Patocka wrote:
> > 
> > I'd like to know - which kernel part needs to sum the percpu IO counters 
> > frequently?
> > 
> > My impression was that the counters need to be summed only when the user 
> > is reading the files in sysfs and that is not frequent at all.
> 
> part_round_stats() does it on IO completion - only every jiffy, but it's
> enough that previous attempts at percpu inflight counters only worked
> for some cases, and were worse for others.

I see.

I thought about it - part_round_stats() is used to calculate two counters 
- time_in_queue and io_ticks.

time_in_queue can be calculating by adding the duration of the I/O when 
the I/O ends (the value will be the same except for in-progress I/Os).

io_ticks could be approximated - if an I/O is started or finished and the 
"jiffies" value changes, we add 1. This is approximation, but if the I/Os 
take less than a jiffy, the value will be the same.


These are the benchmarks for the patches for IOPS on ramdisk (request 
size 512 bytes) and ramdisk with dm-linear attached:

fio --ioengine=psync --iodepth=1 --rw=read --bs=512 --direct=1 --numjobs=12 
--time_based --runtime=10 --group_reporting --name=/dev/ram0

a system with 2 6-core processors:
/dev/ram0   6656445 IOPS
/dev/mapper/linear  2061914 IOPS
/dev/mapper/linear with percpu counters 5500976 IOPS

a system with 1 6-core processor:
/dev/ram0   4019921 IOPS
/dev/mapper/linear  2104687 IOPS
/dev/mapper/linear with percpu counters 3050195 IOPS

a virtual machine (12 virtual cores and 12 physical cores):
/dev/ram0   5304687 IOPS
/dev/mapper/linear  2115234 IOPS
/dev/mapper/linear with percpu counters 4457031 IOPS


My point of view is that we shouldn't degrade I/O throughput just to keep 
the counters accurate, so I suggest to change the counters to 
less-accurate mode. I'll send patches for that.

Device mapper has a separate dm-stats functionality that can provide 
accurate I/O counters for the whole device and for any range - it is off 
by default, so it doesn't degrade performance if the user doesn't need it.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC] dm-bow working prototype

2018-11-15 Thread Mikulas Patocka



On Mon, 29 Oct 2018, Paul Lawrence wrote:

> 
> > The snapshot target could be hacked so that it remembers space trimmed
> > with REQ_OP_DISCARD and won't reallocate these blocks.
> > 
> > But I suspect that running discard over the whole device would degrade
> > performance more than copying some unneeded data.
> > 
> > How much data do you intend to backup with this solution?
> > 
> > 
> We are space-constrained - we will have to free up space for the backup before
> we apply the update, so we have to predict the size and keeping usage as low
> as possible is thus very important.
> 
> Also, we've discussed the resizing requirement of the dm-snap solution and
> that part is not attractive at all - it seems it would be impossible to
> guarantee that the resizing happens in a timely fashion during the (very busy)
> update cycle.
> 
> Thanks everyone for the insights, especially into how dm-snap works, which I
> hadn't fully appreciated. At the moment, and for the above reasons, we intend
> to continue with the dm-bow solution, but do want to keep this discussion
> open. If anyone is going to be at Linux Plumbers, I'll be presenting this work
> and would love to chat about it more.

dm-snapshot took 9 years to fix the last data corruption bug (2004-2013 - 
the commit e9c6a182649f4259db704ae15a91ac820e63b0ca).

And with the new target duplicating the snapshot functionality, it may be 
the same.

Mikulas

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH 2/3] dm: use percpu counters

2018-11-15 Thread Mikulas Patocka
Use percpu inflight counters to avoid cache line bouncing and improve
performance.

Signed-off-by: Mikulas Patocka 

---
 drivers/md/dm-core.h |5 +
 drivers/md/dm.c  |   50 ++
 2 files changed, 39 insertions(+), 16 deletions(-)

Index: linux-dm/drivers/md/dm-core.h
===
--- linux-dm.orig/drivers/md/dm-core.h  2018-11-15 22:06:37.0 +0100
+++ linux-dm/drivers/md/dm-core.h   2018-11-15 22:06:37.0 +0100
@@ -24,6 +24,10 @@ struct dm_kobject_holder {
struct completion completion;
 };
 
+struct dm_percpu {
+   unsigned inflight[2];
+};
+
 /*
  * DM core internal structure that used directly by dm.c and dm-rq.c
  * DM targets must _not_ deference a mapped_device to directly access its 
members!
@@ -63,6 +67,7 @@ struct mapped_device {
/*
 * A list of ios that arrived while we were suspended.
 */
+   struct dm_percpu __percpu *counters;
struct work_struct work;
wait_queue_head_t wait;
spinlock_t deferred_lock;
Index: linux-dm/drivers/md/dm.c
===
--- linux-dm.orig/drivers/md/dm.c   2018-11-15 22:06:37.0 +0100
+++ linux-dm/drivers/md/dm.c2018-11-15 22:09:31.0 +0100
@@ -648,19 +648,32 @@ static void free_tio(struct dm_target_io
 
 int md_in_flight(struct mapped_device *md)
 {
-   return atomic_read(_disk(md)->part0.in_flight[READ]) +
-  atomic_read(_disk(md)->part0.in_flight[WRITE]);
+   int cpu;
+   unsigned sum = 0;
+   for_each_possible_cpu(cpu) {
+   struct dm_percpu *p = per_cpu_ptr(md->counters, cpu);
+   sum += p->inflight[READ] + p->inflight[WRITE];
+   }
+   return (int)sum;
 }
 
 static void start_io_acct(struct dm_io *io)
 {
struct mapped_device *md = io->md;
struct bio *bio = io->orig_bio;
+   struct hd_struct *part;
+   int sgrp, cpu;
 
io->start_time = jiffies;
 
-   generic_start_io_acct(md->queue, bio_op(bio), bio_sectors(bio),
- _disk(md)->part0);
+   part = _disk(md)->part0;
+   sgrp = op_stat_group(bio_op(bio));
+   cpu = part_stat_lock();
+   __part_stat_add(cpu, part, ios[sgrp], 1);
+   __part_stat_add(cpu, part, sectors[sgrp], bio_sectors(bio));
+   part_stat_unlock();
+
+   this_cpu_inc(md->counters->inflight[bio_data_dir(bio)]);
 
if (unlikely(dm_stats_used(>stats)))
dm_stats_account_io(>stats, bio_data_dir(bio),
@@ -673,25 +686,24 @@ static void end_io_acct(struct dm_io *io
struct mapped_device *md = io->md;
struct bio *bio = io->orig_bio;
unsigned long duration = jiffies - io->start_time;
+   struct hd_struct *part;
+   int sgrp, cpu;
 
if (unlikely(dm_stats_used(>stats)))
dm_stats_account_io(>stats, bio_data_dir(bio),
bio->bi_iter.bi_sector, bio_sectors(bio),
true, duration, >stats_aux);
 
-   /*
-* make sure that atomic_dec in generic_end_io_acct is not reordered
-* with previous writes
-*/
-   smp_mb__before_atomic();
-   generic_end_io_acct(md->queue, bio_op(bio), _disk(md)->part0,
-   io->start_time);
-   /*
-* generic_end_io_acct does atomic_dec, this barrier makes sure that
-* atomic_dec is not reordered with waitqueue_active
-*/
-   smp_mb__after_atomic();
+   part = _disk(md)->part0;
+   sgrp = op_stat_group(bio_op(bio));
+   cpu = part_stat_lock();
+   __part_stat_add(cpu, part, nsecs[sgrp], jiffies_to_nsecs(duration));
+   part_stat_unlock();
+
+   smp_wmb();
+   this_cpu_dec(md->counters->inflight[bio_data_dir(bio)]);
 
+   smp_mb();
/* nudge anyone waiting on suspend queue */
if (unlikely(waitqueue_active(>wait))) {
if (!md_in_flight(md))
@@ -1822,6 +1834,8 @@ static void cleanup_mapped_device(struct
if (md->queue)
blk_cleanup_queue(md->queue);
 
+   free_percpu(md->counters);
+
cleanup_srcu_struct(>io_barrier);
 
if (md->bdev) {
@@ -1892,6 +1906,10 @@ static struct mapped_device *alloc_dev(i
if (!md->disk)
goto bad;
 
+   md->counters = alloc_percpu(struct dm_percpu);
+   if (!md->counters)
+   goto bad;
+
init_waitqueue_head(>wait);
INIT_WORK(>work, dm_wq_work);
init_waitqueue_head(>eventq);

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] nvme: allow ANA support to be independent of native multipathing

2018-11-15 Thread Hannes Reinecke

On 11/15/18 6:46 PM, Mike Snitzer wrote:

Whether or not ANA is present is a choice of the target implementation;
the host (and whether it supports multipathing) has _zero_ influence on
this.  If the target declares a path as 'inaccessible' the path _is_
inaccessible to the host.  As such, ANA support should be functional
even if native multipathing is not.

Introduce ability to always re-read ANA log page as required due to ANA
error and make current ANA state available via sysfs -- even if native
multipathing is disabled on the host (e.g. nvme_core.multipath=N).

This affords userspace access to the current ANA state independent of
which layer might be doing multipathing.  It also allows multipath-tools
to rely on the NVMe driver for ANA support while dm-multipath takes care
of multipathing.

While implementing these changes care was taken to preserve the exact
ANA functionality and code sequence native multipathing has provided.
This manifests as native multipathing's nvme_failover_req() being
tweaked to call __nvme_update_ana() which was factored out to allow
nvme_update_ana() to be called independent of nvme_failover_req().

And as always, if embedded NVMe users do not want any performance
overhead associated with ANA or native NVMe multipathing they can
disable CONFIG_NVME_MULTIPATH.

Signed-off-by: Mike Snitzer 
---
  drivers/nvme/host/core.c  | 10 +
  drivers/nvme/host/multipath.c | 49 +--
  drivers/nvme/host/nvme.h  |  4 
  3 files changed, 48 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fe957166c4a9..3df607905628 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -255,10 +255,12 @@ void nvme_complete_rq(struct request *req)
nvme_req(req)->ctrl->comp_seen = true;
  
  	if (unlikely(status != BLK_STS_OK && nvme_req_needs_retry(req))) {

-   if ((req->cmd_flags & REQ_NVME_MPATH) &&
-   blk_path_error(status)) {
-   nvme_failover_req(req);
-   return;
+   if (blk_path_error(status)) {
+   if (req->cmd_flags & REQ_NVME_MPATH) {
+   nvme_failover_req(req);
+   return;
+   }
+   nvme_update_ana(req);
}
  
  		if (!blk_queue_dying(req->q)) {

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 8e03cda770c5..0adbcff5fba2 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -22,7 +22,7 @@ MODULE_PARM_DESC(multipath,
  
  inline bool nvme_ctrl_use_ana(struct nvme_ctrl *ctrl)

  {
-   return multipath && ctrl->subsys && (ctrl->subsys->cmic & (1 << 3));
+   return ctrl->subsys && (ctrl->subsys->cmic & (1 << 3));
  }
  
  /*

@@ -47,6 +47,35 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
}
  }
  
+static bool nvme_ana_error(u16 status)

+{
+   switch (status & 0x7ff) {
+   case NVME_SC_ANA_TRANSITION:
+   case NVME_SC_ANA_INACCESSIBLE:
+   case NVME_SC_ANA_PERSISTENT_LOSS:
+   return true;
+   }
+   return false;
+}
+
+static void __nvme_update_ana(struct nvme_ns *ns)
+{
+   if (!ns->ctrl->ana_log_buf)
+   return;
+
+   set_bit(NVME_NS_ANA_PENDING, >flags);
+   queue_work(nvme_wq, >ctrl->ana_work);
+}
+
+void nvme_update_ana(struct request *req)
+{
+   struct nvme_ns *ns = req->q->queuedata;
+   u16 status = nvme_req(req)->status;
+
+   if (nvme_ana_error(status))
+   __nvme_update_ana(ns);
+}
+
  void nvme_failover_req(struct request *req)
  {
struct nvme_ns *ns = req->q->queuedata;
@@ -58,25 +87,22 @@ void nvme_failover_req(struct request *req)
spin_unlock_irqrestore(>head->requeue_lock, flags);
blk_mq_end_request(req, 0);
  
-	switch (status & 0x7ff) {

-   case NVME_SC_ANA_TRANSITION:
-   case NVME_SC_ANA_INACCESSIBLE:
-   case NVME_SC_ANA_PERSISTENT_LOSS:
+   if (nvme_ana_error(status)) {
/*
 * If we got back an ANA error we know the controller is alive,
 * but not ready to serve this namespaces.  The spec suggests
 * we should update our general state here, but due to the fact
 * that the admin and I/O queues are not serialized that is
 * fundamentally racy.  So instead just clear the current path,
-* mark the the path as pending and kick of a re-read of the ANA
+* mark the path as pending and kick off a re-read of the ANA
 * log page ASAP.
 */
nvme_mpath_clear_current_path(ns);
-   if (ns->ctrl->ana_log_buf) {
-   set_bit(NVME_NS_ANA_PENDING, >flags);
-   queue_work(nvme_wq, >ctrl->ana_work);
-  

[dm-devel] [PATCH V10 00/19] block: support multi-page bvec

2018-11-15 Thread Ming Lei
Hi,

This patchset brings multi-page bvec into block layer:

1) what is multi-page bvec?

Multipage bvecs means that one 'struct bio_bvec' can hold multiple pages
which are physically contiguous instead of one single page used in linux
kernel for long time.

2) why is multi-page bvec introduced?

Kent proposed the idea[1] first. 

As system's RAM becomes much bigger than before, and huge page, transparent
huge page and memory compaction are widely used, it is a bit easy now
to see physically contiguous pages from fs in I/O. On the other hand, from
block layer's view, it isn't necessary to store intermediate pages into bvec,
and it is enough to just store the physicallly contiguous 'segment' in each
io vector.

Also huge pages are being brought to filesystem and swap [2][6], we can
do IO on a hugepage each time[3], which requires that one bio can transfer
at least one huge page one time. Turns out it isn't flexiable to change
BIO_MAX_PAGES simply[3][5]. Multipage bvec can fit in this case very well.
As we saw, if CONFIG_THP_SWAP is enabled, BIO_MAX_PAGES can be configured
as much bigger, such as 512, which requires at least two 4K pages for holding
the bvec table.

With multi-page bvec:

- Inside block layer, both bio splitting and sg map can become more
efficient than before by just traversing the physically contiguous
'segment' instead of each page.

- segment handling in block layer can be improved much in future since it
should be quite easy to convert multipage bvec into segment easily. For
example, we might just store segment in each bvec directly in future.

- bio size can be increased and it should improve some high-bandwidth IO
case in theory[4].

- there is opportunity in future to improve memory footprint of bvecs. 

3) how is multi-page bvec implemented in this patchset?

The patches of 1 ~ 14 implement multipage bvec in block layer:

- put all tricks into bvec/bio/rq iterators, and as far as
drivers and fs use these standard iterators, they are happy
with multipage bvec

- introduce bio_for_each_bvec() to iterate over multipage bvec for 
splitting
bio and mapping sg

- keep current bio_for_each_segment*() to itereate over singlepage bvec 
and
make sure current users won't be broken; especailly, convert to this
new helper prototype in single patch 21 given it is bascially a 
mechanism
conversion

- deal with iomap & xfs's sub-pagesize io vec in patch 13

- enalbe multipage bvec in patch 14 

Patch 15 redefines BIO_MAX_PAGES as 256.

Patch 16 documents usages of bio iterator helpers.

Patch 17~19 kills NO_SG_MERGE.

These patches can be found in the following git tree:

git:  https://github.com/ming1/linux.git  for-4.21-block-mp-bvec-V10

Lots of test(blktest, xfstests, ltp io, ...) have been run with this patchset,
and not see regression.

Thanks Christoph for reviewing the early version and providing very good
suggestions, such as: introduce bio_init_with_vec_table(), remove another
unnecessary helpers for cleanup and so on.

Any comments are welcome!

V10:
- no any code change, just add more guys and list into patch's CC list,
as suggested by Christoph and Dave Chinner
V9:
- fix regression on iomap's sub-pagesize io vec, covered by patch 13
V8:
- remove prepare patches which all are merged to linus tree
- rebase on for-4.21/block
- address comments on V7
- add patches of killing NO_SG_MERGE

V7:
- include Christoph and Mike's bio_clone_bioset() patches, which is
  actually prepare patches for multipage bvec
- address Christoph's comments

V6:
- avoid to introduce lots of renaming, follow Jen's suggestion of
using the name of chunk for multipage io vector
- include Christoph's three prepare patches
- decrease stack usage for using bio_for_each_chunk_segment_all()
- address Kent's comment

V5:
- remove some of prepare patches, which have been merged already
- add bio_clone_seg_bioset() to fix DM's bio clone, which
is introduced by 18a25da84354c6b (dm: ensure bio submission follows
a depth-first tree walk)
- rebase on the latest block for-v4.18

V4:
- rename bio_for_each_segment*() as bio_for_each_page*(), rename
bio_segments() as bio_pages(), rename rq_for_each_segment() as
rq_for_each_pages(), because these helpers never return real
segment, and they always return single page bvec

- introducing segment_for_each_page_all()

- introduce new 
bio_for_each_segment*()/rq_for_each_segment()/bio_segments()
for returning real multipage segment

- rewrite segment_last_page()

- rename bvec iterator helper as suggested by Christoph

- replace comment with applying bio helpers as suggested by Christoph

- document usage of bio iterator helpers

- 

[dm-devel] [PATCH V10 03/19] block: use bio_for_each_bvec() to compute multi-page bvec count

2018-11-15 Thread Ming Lei
First it is more efficient to use bio_for_each_bvec() in both
blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
many multi-page bvecs there are in the bio.

Secondly once bio_for_each_bvec() is used, the bvec may need to be
splitted because its length can be very longer than max segment size,
so we have to split the big bvec into several segments.

Thirdly when splitting multi-page bvec into segments, the max segment
limit may be reached, so the bio split need to be considered under
this situation too.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 90 ++-
 1 file changed, 76 insertions(+), 14 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 91b2af332a84..6f7deb94a23f 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -160,6 +160,62 @@ static inline unsigned get_max_io_size(struct 
request_queue *q,
return sectors;
 }
 
+/*
+ * Split the bvec @bv into segments, and update all kinds of
+ * variables.
+ */
+static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
+   unsigned *nsegs, unsigned *last_seg_size,
+   unsigned *front_seg_size, unsigned *sectors)
+{
+   bool need_split = false;
+   unsigned len = bv->bv_len;
+   unsigned total_len = 0;
+   unsigned new_nsegs = 0, seg_size = 0;
+
+   if ((*nsegs >= queue_max_segments(q)) || !len)
+   return need_split;
+
+   /*
+* Multipage bvec may be too big to hold in one segment,
+* so the current bvec has to be splitted as multiple
+* segments.
+*/
+   while (new_nsegs + *nsegs < queue_max_segments(q)) {
+   seg_size = min(queue_max_segment_size(q), len);
+
+   new_nsegs++;
+   total_len += seg_size;
+   len -= seg_size;
+
+   if ((queue_virt_boundary(q) && ((bv->bv_offset +
+   total_len) & queue_virt_boundary(q))) || !len)
+   break;
+   }
+
+   /* split in the middle of the bvec */
+   if (len)
+   need_split = true;
+
+   /* update front segment size */
+   if (!*nsegs) {
+   unsigned first_seg_size = seg_size;
+
+   if (new_nsegs > 1)
+   first_seg_size = queue_max_segment_size(q);
+   if (*front_seg_size < first_seg_size)
+   *front_seg_size = first_seg_size;
+   }
+
+   /* update other varibles */
+   *last_seg_size = seg_size;
+   *nsegs += new_nsegs;
+   if (sectors)
+   *sectors += total_len >> 9;
+
+   return need_split;
+}
+
 static struct bio *blk_bio_segment_split(struct request_queue *q,
 struct bio *bio,
 struct bio_set *bs,
@@ -173,7 +229,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
 
-   bio_for_each_segment(bv, bio, iter) {
+   bio_for_each_bvec(bv, bio, iter) {
/*
 * If the queue doesn't support SG gaps and adding this
 * offset would create a gap, disallow it.
@@ -188,8 +244,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 */
if (nsegs < queue_max_segments(q) &&
sectors < max_sectors) {
-   nsegs++;
-   sectors = max_sectors;
+   /* split in the middle of bvec */
+   bv.bv_len = (max_sectors - sectors) << 9;
+   bvec_split_segs(q, , ,
+   _size,
+   _seg_size,
+   );
}
goto split;
}
@@ -214,11 +274,12 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
if (nsegs == 1 && seg_size > front_seg_size)
front_seg_size = seg_size;
 
-   nsegs++;
bvprv = bv;
bvprvp = 
-   seg_size = bv.bv_len;
-   sectors += bv.bv_len >> 9;
+
+   if (bvec_split_segs(q, , , _size,
+ 

[dm-devel] [PATCH V10 04/19] block: use bio_for_each_bvec() to map sg

2018-11-15 Thread Ming Lei
It is more efficient to use bio_for_each_bvec() to map sg, meantime
we have to consider splitting multipage bvec as done in blk_bio_segment_split().

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 72 +++
 1 file changed, 52 insertions(+), 20 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 6f7deb94a23f..cb9f49bcfd36 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -473,6 +473,56 @@ static int blk_phys_contig_segment(struct request_queue 
*q, struct bio *bio,
return biovec_phys_mergeable(q, _bv, _bv);
 }
 
+static struct scatterlist *blk_next_sg(struct scatterlist **sg,
+   struct scatterlist *sglist)
+{
+   if (!*sg)
+   return sglist;
+   else {
+   /*
+* If the driver previously mapped a shorter
+* list, we could see a termination bit
+* prematurely unless it fully inits the sg
+* table on each mapping. We KNOW that there
+* must be more entries here or the driver
+* would be buggy, so force clear the
+* termination bit to avoid doing a full
+* sg_init_table() in drivers for each command.
+*/
+   sg_unmark_end(*sg);
+   return sg_next(*sg);
+   }
+}
+
+static unsigned blk_bvec_map_sg(struct request_queue *q,
+   struct bio_vec *bvec, struct scatterlist *sglist,
+   struct scatterlist **sg)
+{
+   unsigned nbytes = bvec->bv_len;
+   unsigned nsegs = 0, total = 0;
+
+   while (nbytes > 0) {
+   unsigned seg_size;
+   struct page *pg;
+   unsigned offset, idx;
+
+   *sg = blk_next_sg(sg, sglist);
+
+   seg_size = min(nbytes, queue_max_segment_size(q));
+   offset = (total + bvec->bv_offset) % PAGE_SIZE;
+   idx = (total + bvec->bv_offset) / PAGE_SIZE;
+   pg = nth_page(bvec->bv_page, idx);
+
+   sg_set_page(*sg, pg, seg_size, offset);
+
+   total += seg_size;
+   nbytes -= seg_size;
+   nsegs++;
+   }
+
+   return nsegs;
+}
+
 static inline void
 __blk_segment_map_sg(struct request_queue *q, struct bio_vec *bvec,
 struct scatterlist *sglist, struct bio_vec *bvprv,
@@ -490,25 +540,7 @@ __blk_segment_map_sg(struct request_queue *q, struct 
bio_vec *bvec,
(*sg)->length += nbytes;
} else {
 new_segment:
-   if (!*sg)
-   *sg = sglist;
-   else {
-   /*
-* If the driver previously mapped a shorter
-* list, we could see a termination bit
-* prematurely unless it fully inits the sg
-* table on each mapping. We KNOW that there
-* must be more entries here or the driver
-* would be buggy, so force clear the
-* termination bit to avoid doing a full
-* sg_init_table() in drivers for each command.
-*/
-   sg_unmark_end(*sg);
-   *sg = sg_next(*sg);
-   }
-
-   sg_set_page(*sg, bvec->bv_page, nbytes, bvec->bv_offset);
-   (*nsegs)++;
+   (*nsegs) += blk_bvec_map_sg(q, bvec, sglist, sg);
}
*bvprv = *bvec;
 }
@@ -530,7 +562,7 @@ static int __blk_bios_map_sg(struct request_queue *q, 
struct bio *bio,
int cluster = blk_queue_cluster(q), nsegs = 0;
 
for_each_bio(bio)
-   bio_for_each_segment(bvec, bio, iter)
+   bio_for_each_bvec(bvec, bio, iter)
__blk_segment_map_sg(q, , sglist, , sg,
 , );
 
-- 
2.9.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH V10 06/19] fs/buffer.c: use bvec iterator to truncate the bio

2018-11-15 Thread Ming Lei
Once multi-page bvec is enabled, the last bvec may include more than one
page, this patch use bvec_last_segment() to truncate the bio.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 fs/buffer.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 1286c2b95498..fa37ad52e962 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3032,7 +3032,10 @@ void guard_bio_eod(int op, struct bio *bio)
 
/* ..and clear the end of the buffer for reads */
if (op == REQ_OP_READ) {
-   zero_user(bvec->bv_page, bvec->bv_offset + bvec->bv_len,
+   struct bio_vec bv;
+
+   bvec_last_segment(bvec, );
+   zero_user(bv.bv_page, bv.bv_offset + bv.bv_len,
truncated_bytes);
}
 }
-- 
2.9.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH V10 15/19] block: always define BIO_MAX_PAGES as 256

2018-11-15 Thread Ming Lei
Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
increase BIO_MAX_PAGES for it.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 include/linux/bio.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5040e9a2eb09..277921ad42e7 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -34,15 +34,7 @@
 #define BIO_BUG_ON
 #endif
 
-#ifdef CONFIG_THP_SWAP
-#if HPAGE_PMD_NR > 256
-#define BIO_MAX_PAGES  HPAGE_PMD_NR
-#else
 #define BIO_MAX_PAGES  256
-#endif
-#else
-#define BIO_MAX_PAGES  256
-#endif
 
 #define bio_prio(bio)  (bio)->bi_ioprio
 #define bio_set_prio(bio, prio)((bio)->bi_ioprio = prio)
-- 
2.9.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH V10 07/19] btrfs: use bvec_last_segment to get bio's last page

2018-11-15 Thread Ming Lei
Preparing for supporting multi-page bvec.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 fs/btrfs/compression.c | 5 -
 fs/btrfs/extent_io.c   | 5 +++--
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 2955a4ea2fa8..161e14b8b180 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -400,8 +400,11 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
*inode, u64 start,
 static u64 bio_end_offset(struct bio *bio)
 {
struct bio_vec *last = bio_last_bvec_all(bio);
+   struct bio_vec bv;
 
-   return page_offset(last->bv_page) + last->bv_len + last->bv_offset;
+   bvec_last_segment(last, );
+
+   return page_offset(bv.bv_page) + bv.bv_len + bv.bv_offset;
 }
 
 static noinline int add_ra_bio_pages(struct inode *inode,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d228f706ff3e..5d5965297e7e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2720,11 +2720,12 @@ static int __must_check submit_one_bio(struct bio *bio, 
int mirror_num,
 {
blk_status_t ret = 0;
struct bio_vec *bvec = bio_last_bvec_all(bio);
-   struct page *page = bvec->bv_page;
+   struct bio_vec bv;
struct extent_io_tree *tree = bio->bi_private;
u64 start;
 
-   start = page_offset(page) + bvec->bv_offset;
+   bvec_last_segment(bvec, );
+   start = page_offset(bv.bv_page) + bv.bv_offset;
 
bio->bi_private = NULL;
 
-- 
2.9.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-15 Thread Ming Lei
There are still cases in which we need to use bio_bvecs() for get the
number of multi-page segment, so introduce it.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 include/linux/bio.h | 30 +-
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1f0dcf109841..3496c816946e 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -196,7 +196,6 @@ static inline unsigned bio_segments(struct bio *bio)
 * We special case discard/write same/write zeroes, because they
 * interpret bi_size differently:
 */
-
switch (bio_op(bio)) {
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
@@ -205,13 +204,34 @@ static inline unsigned bio_segments(struct bio *bio)
case REQ_OP_WRITE_SAME:
return 1;
default:
-   break;
+   bio_for_each_segment(bv, bio, iter)
+   segs++;
+   return segs;
}
+}
 
-   bio_for_each_segment(bv, bio, iter)
-   segs++;
+static inline unsigned bio_bvecs(struct bio *bio)
+{
+   unsigned bvecs = 0;
+   struct bio_vec bv;
+   struct bvec_iter iter;
 
-   return segs;
+   /*
+* We special case discard/write same/write zeroes, because they
+* interpret bi_size differently:
+*/
+   switch (bio_op(bio)) {
+   case REQ_OP_DISCARD:
+   case REQ_OP_SECURE_ERASE:
+   case REQ_OP_WRITE_ZEROES:
+   return 0;
+   case REQ_OP_WRITE_SAME:
+   return 1;
+   default:
+   bio_for_each_bvec(bv, bio, iter)
+   bvecs++;
+   return bvecs;
+   }
 }
 
 /*
-- 
2.9.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH V10 10/19] block: loop: pass multi-page bvec to iov_iter

2018-11-15 Thread Ming Lei
iov_iter is implemented with bvec itererator, so it is safe to pass
multipage bvec to it, and this way is much more efficient than
passing one page in each bvec.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 drivers/block/loop.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index bf6bc35aaf88..a3fd418ec637 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -515,16 +515,16 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
struct bio *bio = rq->bio;
struct file *file = lo->lo_backing_file;
unsigned int offset;
-   int segments = 0;
+   int nr_bvec = 0;
int ret;
 
if (rq->bio != rq->biotail) {
-   struct req_iterator iter;
+   struct bvec_iter iter;
struct bio_vec tmp;
 
__rq_for_each_bio(bio, rq)
-   segments += bio_segments(bio);
-   bvec = kmalloc_array(segments, sizeof(struct bio_vec),
+   nr_bvec += bio_bvecs(bio);
+   bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
 GFP_NOIO);
if (!bvec)
return -EIO;
@@ -533,13 +533,14 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
/*
 * The bios of the request may be started from the middle of
 * the 'bvec' because of bio splitting, so we can't directly
-* copy bio->bi_iov_vec to new bvec. The rq_for_each_segment
+* copy bio->bi_iov_vec to new bvec. The bio_for_each_bvec
 * API will take care of all details for us.
 */
-   rq_for_each_segment(tmp, rq, iter) {
-   *bvec = tmp;
-   bvec++;
-   }
+   __rq_for_each_bio(bio, rq)
+   bio_for_each_bvec(tmp, bio, iter) {
+   *bvec = tmp;
+   bvec++;
+   }
bvec = cmd->bvec;
offset = 0;
} else {
@@ -550,11 +551,11 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 */
offset = bio->bi_iter.bi_bvec_done;
bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
-   segments = bio_segments(bio);
+   nr_bvec = bio_bvecs(bio);
}
atomic_set(>ref, 2);
 
-   iov_iter_bvec(, rw, bvec, segments, blk_rq_bytes(rq));
+   iov_iter_bvec(, rw, bvec, nr_bvec, blk_rq_bytes(rq));
iter.iov_offset = offset;
 
cmd->iocb.ki_pos = pos;
-- 
2.9.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH V10 18/19] block: kill QUEUE_FLAG_NO_SG_MERGE

2018-11-15 Thread Ming Lei
Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"),
physical segment number is mainly figured out in blk_queue_split() for
fast path, and the flag of BIO_SEG_VALID is set there too.

Now only blk_recount_segments() and blk_recalc_rq_segments() use this
flag.

Basically blk_recount_segments() is bypassed in fast path given BIO_SEG_VALID
is set in blk_queue_split().

For another user of blk_recalc_rq_segments():

- run in partial completion branch of blk_update_request, which is an unusual 
case

- run in blk_cloned_rq_check_limits(), still not a big problem if the flag is 
killed
since dm-rq is the only user.

Multi-page bvec is enabled now, QUEUE_FLAG_NO_SG_MERGE doesn't make sense any 
more.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 block/blk-merge.c  | 31 ++-
 block/blk-mq-debugfs.c |  1 -
 block/blk-mq.c |  3 ---
 drivers/md/dm-table.c  | 13 -
 include/linux/blkdev.h |  1 -
 5 files changed, 6 insertions(+), 43 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 153a659fde74..06be298be332 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -351,8 +351,7 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio)
 EXPORT_SYMBOL(blk_queue_split);
 
 static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
-struct bio *bio,
-bool no_sg_merge)
+struct bio *bio)
 {
struct bio_vec bv, bvprv = { NULL };
int cluster, prev = 0;
@@ -379,13 +378,6 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
nr_phys_segs = 0;
for_each_bio(bio) {
bio_for_each_bvec(bv, bio, iter) {
-   /*
-* If SG merging is disabled, each bio vector is
-* a segment
-*/
-   if (no_sg_merge)
-   goto new_segment;
-
if (prev && cluster) {
if (seg_size + bv.bv_len
> queue_max_segment_size(q))
@@ -420,27 +412,16 @@ static unsigned int __blk_recalc_rq_segments(struct 
request_queue *q,
 
 void blk_recalc_rq_segments(struct request *rq)
 {
-   bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE,
-   >q->queue_flags);
-
-   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio,
-   no_sg_merge);
+   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);
 }
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-   unsigned short seg_cnt = bio_segments(bio);
-
-   if (test_bit(QUEUE_FLAG_NO_SG_MERGE, >queue_flags) &&
-   (seg_cnt < queue_max_segments(q)))
-   bio->bi_phys_segments = seg_cnt;
-   else {
-   struct bio *nxt = bio->bi_next;
+   struct bio *nxt = bio->bi_next;
 
-   bio->bi_next = NULL;
-   bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio, false);
-   bio->bi_next = nxt;
-   }
+   bio->bi_next = NULL;
+   bio->bi_phys_segments = __blk_recalc_rq_segments(q, bio);
+   bio->bi_next = nxt;
 
bio_set_flag(bio, BIO_SEG_VALID);
 }
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index f021f4817b80..e188b1090759 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -128,7 +128,6 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(SAME_FORCE),
QUEUE_FLAG_NAME(DEAD),
QUEUE_FLAG_NAME(INIT_DONE),
-   QUEUE_FLAG_NAME(NO_SG_MERGE),
QUEUE_FLAG_NAME(POLL),
QUEUE_FLAG_NAME(WC),
QUEUE_FLAG_NAME(FUA),
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 411be60d0cb6..ed484af5744b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2755,9 +2755,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
 
q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
 
-   if (!(set->flags & BLK_MQ_F_SG_MERGE))
-   queue_flag_set_unlocked(QUEUE_FLAG_NO_SG_MERGE, q);
-
q->sg_reserved_size = INT_MAX;
 
INIT_DELAYED_WORK(>requeue_work, blk_mq_requeue_work);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 9038c302d5c2..22fed6987aea 100644
--- 

[dm-devel] [PATCH V10 13/19] iomap & xfs: only account for new added page

2018-11-15 Thread Ming Lei
After multi-page is enabled, one new page may be merged to a segment
even though it is a new added page.

This patch deals with this issue by post-check in case of merge, and
only a freshly new added page need to be dealt with for iomap & xfs.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 fs/iomap.c  | 22 ++
 fs/xfs/xfs_aops.c   | 10 --
 include/linux/bio.h | 11 +++
 3 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/fs/iomap.c b/fs/iomap.c
index df0212560b36..a1b97a5c726a 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -288,6 +288,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
loff_t orig_pos = pos;
unsigned poff, plen;
sector_t sector;
+   bool need_account = false;
 
if (iomap->type == IOMAP_INLINE) {
WARN_ON_ONCE(pos);
@@ -313,18 +314,15 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
 */
sector = iomap_sector(iomap, pos);
if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
-   if (__bio_try_merge_page(ctx->bio, page, plen, poff))
+   if (__bio_try_merge_page(ctx->bio, page, plen, poff)) {
+   need_account = iop && bio_is_last_segment(ctx->bio,
+   page, plen, poff);
goto done;
+   }
is_contig = true;
}
 
-   /*
-* If we start a new segment we need to increase the read count, and we
-* need to do so before submitting any previous full bio to make sure
-* that we don't prematurely unlock the page.
-*/
-   if (iop)
-   atomic_inc(>read_count);
+   need_account = true;
 
if (!ctx->bio || !is_contig || bio_full(ctx->bio)) {
gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
@@ -347,6 +345,14 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
loff_t length, void *data,
__bio_add_page(ctx->bio, page, plen, poff);
 done:
/*
+* If we add a new page we need to increase the read count, and we
+* need to do so before submitting any previous full bio to make sure
+* that we don't prematurely unlock the page.
+*/
+   if (iop && need_account)
+   atomic_inc(>read_count);
+
+   /*
 * Move the caller beyond our range so that it keeps making progress.
 * For that we have to include any leading non-uptodate ranges, but
 * we can skip trailing ones as they will be handled in the next
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 1f1829e506e8..d8e9cc9f751a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -603,6 +603,7 @@ xfs_add_to_ioend(
unsignedlen = i_blocksize(inode);
unsignedpoff = offset & (PAGE_SIZE - 1);
sector_tsector;
+   boolneed_account;
 
sector = xfs_fsb_to_db(ip, wpc->imap.br_startblock) +
((offset - XFS_FSB_TO_B(mp, wpc->imap.br_startoff)) >> 9);
@@ -617,13 +618,18 @@ xfs_add_to_ioend(
}
 
if (!__bio_try_merge_page(wpc->ioend->io_bio, page, len, poff)) {
-   if (iop)
-   atomic_inc(>write_count);
+   need_account = true;
if (bio_full(wpc->ioend->io_bio))
xfs_chain_bio(wpc->ioend, wbc, bdev, sector);
__bio_add_page(wpc->ioend->io_bio, page, len, poff);
+   } else {
+   need_account = iop && bio_is_last_segment(wpc->ioend->io_bio,
+   page, len, poff);
}
 
+   if (iop && need_account)
+   atomic_inc(>write_count);
+
wpc->ioend->io_size += len;
 }
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1a2430a8b89d..5040e9a2eb09 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -341,6 +341,17 @@ static inline struct bio_vec *bio_last_bvec_all(struct bio 
*bio)
return >bi_io_vec[bio->bi_vcnt - 1];
 }
 
+/* iomap needs this helper to deal with sub-pagesize bvec */
+static inline bool bio_is_last_segment(struct bio *bio, struct page *page,
+   unsigned int len, unsigned int off)
+{
+   struct bio_vec bv;
+
+   bvec_last_segment(bio_last_bvec_all(bio), );
+
+   return bv.bv_page == page && bv.bv_len == len && 

[dm-devel] [PATCH V10 14/19] block: enable multipage bvecs

2018-11-15 Thread Ming Lei
This patch pulls the trigger for multi-page bvecs.

Now any request queue which supports queue cluster will see multi-page
bvecs.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 block/bio.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 6486722d4d4b..ed6df6f8e63d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -767,12 +767,24 @@ bool __bio_try_merge_page(struct bio *bio, struct page 
*page,
 
if (bio->bi_vcnt > 0) {
struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
-
-   if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
-   bv->bv_len += len;
-   bio->bi_iter.bi_size += len;
-   return true;
-   }
+   struct request_queue *q = NULL;
+
+   if (page == bv->bv_page && off == (bv->bv_offset + bv->bv_len)
+   && (off + len) <= PAGE_SIZE)
+   goto merge;
+
+   if (bio->bi_disk)
+   q = bio->bi_disk->queue;
+
+   /* disable multi-page bvec too if cluster isn't enabled */
+   if (!q || !blk_queue_cluster(q) ||
+   ((page_to_phys(bv->bv_page) + bv->bv_offset + bv->bv_len) !=
+(page_to_phys(page) + off)))
+   return false;
+ merge:
+   bv->bv_len += len;
+   bio->bi_iter.bi_size += len;
+   return true;
}
return false;
 }
-- 
2.9.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH V10 17/19] block: don't use bio->bi_vcnt to figure out segment number

2018-11-15 Thread Ming Lei
It is wrong to use bio->bi_vcnt to figure out how many segments
there are in the bio even though CLONED flag isn't set on this bio,
because this bio may be splitted or advanced.

So always use bio_segments() in blk_recount_segments(), and it shouldn't
cause any performance loss now because the physical segment number is figured
out in blk_queue_split() and BIO_SEG_VALID is set meantime since
bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").

Cc: Dave Chinner 
Cc: Kent Overstreet 
Fixes: 7f60dcaaf91 ("block: blk-merge: fix blk_recount_segments()")
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index cb9f49bcfd36..153a659fde74 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -429,13 +429,7 @@ void blk_recalc_rq_segments(struct request *rq)
 
 void blk_recount_segments(struct request_queue *q, struct bio *bio)
 {
-   unsigned short seg_cnt;
-
-   /* estimate segment number by bi_vcnt for non-cloned bio */
-   if (bio_flagged(bio, BIO_CLONED))
-   seg_cnt = bio_segments(bio);
-   else
-   seg_cnt = bio->bi_vcnt;
+   unsigned short seg_cnt = bio_segments(bio);
 
if (test_bit(QUEUE_FLAG_NO_SG_MERGE, >queue_flags) &&
(seg_cnt < queue_max_segments(q)))
-- 
2.9.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH V10 19/19] block: kill BLK_MQ_F_SG_MERGE

2018-11-15 Thread Ming Lei
QUEUE_FLAG_NO_SG_MERGE has been killed, so kill BLK_MQ_F_SG_MERGE too.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c   | 1 -
 drivers/block/loop.c | 2 +-
 drivers/block/nbd.c  | 2 +-
 drivers/block/rbd.c  | 2 +-
 drivers/block/skd_main.c | 1 -
 drivers/block/xen-blkfront.c | 2 +-
 drivers/md/dm-rq.c   | 2 +-
 drivers/mmc/core/queue.c | 3 +--
 drivers/scsi/scsi_lib.c  | 2 +-
 include/linux/blk-mq.h   | 1 -
 10 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index e188b1090759..e1c12358391a 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -250,7 +250,6 @@ static const char *const alloc_policy_name[] = {
 static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(SHOULD_MERGE),
HCTX_FLAG_NAME(TAG_SHARED),
-   HCTX_FLAG_NAME(SG_MERGE),
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a3fd418ec637..d509902a8046 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1907,7 +1907,7 @@ static int loop_add(struct loop_device **l, int i)
lo->tag_set.queue_depth = 128;
lo->tag_set.numa_node = NUMA_NO_NODE;
lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-   lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
lo->tag_set.driver_data = lo;
 
err = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 08696f5f00bb..999c94de78e5 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -1570,7 +1570,7 @@ static int nbd_dev_add(int index)
nbd->tag_set.numa_node = NUMA_NO_NODE;
nbd->tag_set.cmd_size = sizeof(struct nbd_cmd);
nbd->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-   BLK_MQ_F_SG_MERGE | BLK_MQ_F_BLOCKING;
+   BLK_MQ_F_BLOCKING;
nbd->tag_set.driver_data = nbd;
 
err = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 8e5140bbf241..3dfd300b5283 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3988,7 +3988,7 @@ static int rbd_init_disk(struct rbd_device *rbd_dev)
rbd_dev->tag_set.ops = _mq_ops;
rbd_dev->tag_set.queue_depth = rbd_dev->opts->queue_depth;
rbd_dev->tag_set.numa_node = NUMA_NO_NODE;
-   rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   rbd_dev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
rbd_dev->tag_set.nr_hw_queues = 1;
rbd_dev->tag_set.cmd_size = sizeof(struct work_struct);
 
diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index a10d5736d8f7..a7040f9a1b1b 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -2843,7 +2843,6 @@ static int skd_cons_disk(struct skd_device *skdev)
skdev->sgs_per_request * sizeof(struct scatterlist);
skdev->tag_set.numa_node = NUMA_NO_NODE;
skdev->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
-   BLK_MQ_F_SG_MERGE |
BLK_ALLOC_POLICY_TO_MQ_FLAG(BLK_TAG_ALLOC_FIFO);
skdev->tag_set.driver_data = skdev;
rc = blk_mq_alloc_tag_set(>tag_set);
diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 56452cabce5b..297412bf23e1 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -977,7 +977,7 @@ static int xlvbd_init_blk_queue(struct gendisk *gd, u16 
sector_size,
} else
info->tag_set.queue_depth = BLK_RING_SIZE(info);
info->tag_set.numa_node = NUMA_NO_NODE;
-   info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   info->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
info->tag_set.cmd_size = sizeof(struct blkif_req);
info->tag_set.driver_data = info;
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 7cd36e4d1310..140ada0b99fc 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -536,7 +536,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, 
struct dm_table *t)
md->tag_set->ops = _mq_ops;
md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
md->tag_set->numa_node = md->numa_node_id;
-   md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_SG_MERGE;
+   md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;

[dm-devel] [PATCH V10 16/19] block: document usage of bio iterator helpers

2018-11-15 Thread Ming Lei
Now multi-page bvec is supported, some helpers may return page by
page, meantime some may return segment by segment, this patch
documents the usage.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 Documentation/block/biovecs.txt | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/Documentation/block/biovecs.txt b/Documentation/block/biovecs.txt
index 25689584e6e0..bfafb70d0d9e 100644
--- a/Documentation/block/biovecs.txt
+++ b/Documentation/block/biovecs.txt
@@ -117,3 +117,29 @@ Other implications:
size limitations and the limitations of the underlying devices. Thus
there's no need to define ->merge_bvec_fn() callbacks for individual block
drivers.
+
+Usage of helpers:
+=
+
+* The following helpers whose names have the suffix of "_all" can only be used
+on non-BIO_CLONED bio, and usually they are used by filesystem code, and driver
+shouldn't use them because bio may have been split before they got to the 
driver:
+
+   bio_for_each_segment_all()
+   bio_first_bvec_all()
+   bio_first_page_all()
+   bio_last_bvec_all()
+
+* The following helpers iterate over single-page bvec, and the local
+variable of 'struct bio_vec' or the reference records single-page IO
+vector during the itearation:
+
+   bio_for_each_segment()
+   bio_for_each_segment_all()
+
+* The following helper iterates over multi-page bvec, and each bvec may
+include multiple physically contiguous pages, and the local variable of
+'struct bio_vec' or the reference records multi-page IO vector during the
+itearation:
+
+   bio_for_each_bvec()
-- 
2.9.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH V10 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()

2018-11-15 Thread Ming Lei
bch_bio_alloc_pages() is always called on one new bio, so it is safe
to access the bvec table directly. Given it is the only kind of this
case, open code the bvec table access since bio_for_each_segment_all()
will be changed to support for iterating over multipage bvec.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Acked-by: Coly Li 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 drivers/md/bcache/util.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
index 20eddeac1531..8517aebcda2d 100644
--- a/drivers/md/bcache/util.c
+++ b/drivers/md/bcache/util.c
@@ -270,7 +270,7 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
int i;
struct bio_vec *bv;
 
-   bio_for_each_segment_all(bv, bio, i) {
+   for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++) {
bv->bv_page = alloc_page(gfp_mask);
if (!bv->bv_page) {
while (--bv >= bio->bi_io_vec)
-- 
2.9.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH V10 08/19] btrfs: move bio_pages_all() to btrfs

2018-11-15 Thread Ming Lei
BTRFS is the only user of this helper, so move this helper into
BTRFS, and implement it via bio_for_each_segment_all(), since
bio->bi_vcnt may not equal to number of pages after multipage bvec
is enabled.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 fs/btrfs/extent_io.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 5d5965297e7e..874bb9aeebdc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2348,6 +2348,18 @@ struct bio *btrfs_create_repair_bio(struct inode *inode, 
struct bio *failed_bio,
return bio;
 }
 
+static unsigned btrfs_bio_pages_all(struct bio *bio)
+{
+   unsigned i;
+   struct bio_vec *bv;
+
+   WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED));
+
+   bio_for_each_segment_all(bv, bio, i)
+   ;
+   return i;
+}
+
 /*
  * this is a generic handler for readpage errors (default
  * readpage_io_failed_hook). if other copies exist, read those and write back
@@ -2368,7 +2380,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
int read_mode = 0;
blk_status_t status;
int ret;
-   unsigned failed_bio_pages = bio_pages_all(failed_bio);
+   unsigned failed_bio_pages = btrfs_bio_pages_all(failed_bio);
 
BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
 
-- 
2.9.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH V10 01/19] block: introduce multi-page page bvec helpers

2018-11-15 Thread Ming Lei
This patch introduces helpers of 'mp_bvec_iter_*' for multipage
bvec support.

The introduced helpers treate one bvec as real multi-page segment,
which may include more than one pages.

The existed helpers of bvec_iter_* are interfaces for supporting current
bvec iterator which is thought as single-page by drivers, fs, dm and
etc. These introduced helpers will build single-page bvec in flight, so
this way won't break current bio/bvec users, which needn't any change.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 63 +---
 1 file changed, 60 insertions(+), 3 deletions(-)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 02c73c6aa805..8ef904a50577 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -23,6 +23,44 @@
 #include 
 #include 
 #include 
+#include 
+
+/*
+ * What is multi-page bvecs?
+ *
+ * - bvecs stored in bio->bi_io_vec is always multi-page(mp) style
+ *
+ * - bvec(struct bio_vec) represents one physically contiguous I/O
+ *   buffer, now the buffer may include more than one pages after
+ *   multi-page(mp) bvec is supported, and all these pages represented
+ *   by one bvec is physically contiguous. Before mp support, at most
+ *   one page is included in one bvec, we call it single-page(sp)
+ *   bvec.
+ *
+ * - .bv_page of the bvec represents the 1st page in the mp bvec
+ *
+ * - .bv_offset of the bvec represents offset of the buffer in the bvec
+ *
+ * The effect on the current drivers/filesystem/dm/bcache/...:
+ *
+ * - almost everyone supposes that one bvec only includes one single
+ *   page, so we keep the sp interface not changed, for example,
+ *   bio_for_each_segment() still returns bvec with single page
+ *
+ * - bio_for_each_segment*() will be changed to return single-page
+ *   bvec too
+ *
+ * - during iterating, iterator variable(struct bvec_iter) is always
+ *   updated in multipage bvec style and that means bvec_iter_advance()
+ *   is kept not changed
+ *
+ * - returned(copied) single-page bvec is built in flight by bvec
+ *   helpers from the stored multipage bvec
+ *
+ * - In case that some components(such as iov_iter) need to support
+ *   multi-page bvec, we introduce new helpers(mp_bvec_iter_*) for
+ *   them.
+ */
 
 /*
  * was unsigned short, but we might as well be ready for > 64kB I/O pages
@@ -50,16 +88,35 @@ struct bvec_iter {
  */
 #define __bvec_iter_bvec(bvec, iter)   (&(bvec)[(iter).bi_idx])
 
-#define bvec_iter_page(bvec, iter) \
+#define mp_bvec_iter_page(bvec, iter)  \
(__bvec_iter_bvec((bvec), (iter))->bv_page)
 
-#define bvec_iter_len(bvec, iter)  \
+#define mp_bvec_iter_len(bvec, iter)   \
min((iter).bi_size, \
__bvec_iter_bvec((bvec), (iter))->bv_len - (iter).bi_bvec_done)
 
-#define bvec_iter_offset(bvec, iter)   \
+#define mp_bvec_iter_offset(bvec, iter)\
(__bvec_iter_bvec((bvec), (iter))->bv_offset + (iter).bi_bvec_done)
 
+#define mp_bvec_iter_page_idx(bvec, iter)  \
+   (mp_bvec_iter_offset((bvec), (iter)) / PAGE_SIZE)
+
+/*
+ *  of single-page(sp) segment.
+ *
+ * This helpers are for building sp bvec in flight.
+ */
+#define bvec_iter_offset(bvec, iter)   \
+   (mp_bvec_iter_offset((bvec), (iter)) % PAGE_SIZE)
+
+#define bvec_iter_len(bvec, iter)  \
+   min_t(unsigned, mp_bvec_iter_len((bvec), (iter)),   \
+   (PAGE_SIZE - (bvec_iter_offset((bvec), (iter)
+
+#define bvec_iter_page(bvec, iter) \
+   nth_page(mp_bvec_iter_page((bvec), (iter)), \
+mp_bvec_iter_page_idx((bvec), (iter)))
+
 #define bvec_iter_bvec(bvec, iter) \
 ((struct bio_vec) {\
.bv_page= bvec_iter_page((bvec), (iter)),   \
-- 
2.9.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH V10 02/19] block: introduce bio_for_each_bvec()

2018-11-15 Thread Ming Lei
This helper is used for iterating over multi-page bvec for bio
split & merge code.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 include/linux/bio.h  | 34 +++---
 include/linux/bvec.h | 36 
 2 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 056fb627edb3..1f0dcf109841 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -76,6 +76,9 @@
 #define bio_data_dir(bio) \
(op_is_write(bio_op(bio)) ? WRITE : READ)
 
+#define bio_iter_mp_iovec(bio, iter)   \
+   mp_bvec_iter_bvec((bio)->bi_io_vec, (iter))
+
 /*
  * Check whether this bio carries any data or not. A NULL bio is allowed.
  */
@@ -135,18 +138,33 @@ static inline bool bio_full(struct bio *bio)
 #define bio_for_each_segment_all(bvl, bio, i)  \
for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
 
-static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
-   unsigned bytes)
+static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
+ unsigned bytes, bool mp)
 {
iter->bi_sector += bytes >> 9;
 
if (bio_no_advance_iter(bio))
iter->bi_size -= bytes;
else
-   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   if (!mp)
+   bvec_iter_advance(bio->bi_io_vec, iter, bytes);
+   else
+   mp_bvec_iter_advance(bio->bi_io_vec, iter, bytes);
/* TODO: It is reasonable to complete bio with error here. */
 }
 
+static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
+   unsigned bytes)
+{
+   __bio_advance_iter(bio, iter, bytes, false);
+}
+
+static inline void bio_advance_mp_iter(struct bio *bio, struct bvec_iter *iter,
+  unsigned bytes)
+{
+   __bio_advance_iter(bio, iter, bytes, true);
+}
+
 #define __bio_for_each_segment(bvl, bio, iter, start)  \
for (iter = (start);\
 (iter).bi_size &&  \
@@ -156,6 +174,16 @@ static inline void bio_advance_iter(struct bio *bio, 
struct bvec_iter *iter,
 #define bio_for_each_segment(bvl, bio, iter)   \
__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
 
+#define __bio_for_each_bvec(bvl, bio, iter, start) \
+   for (iter = (start);\
+(iter).bi_size &&  \
+   ((bvl = bio_iter_mp_iovec((bio), (iter))), 1);  \
+bio_advance_mp_iter((bio), &(iter), (bvl).bv_len))
+
+/* returns one real segment(multipage bvec) each time */
+#define bio_for_each_bvec(bvl, bio, iter)  \
+   __bio_for_each_bvec(bvl, bio, iter, (bio)->bi_iter)
+
 #define bio_iter_last(bvec, iter) ((iter).bi_size == (bvec).bv_len)
 
 static inline unsigned bio_segments(struct bio *bio)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 8ef904a50577..3d61352cd8cf 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -124,8 +124,16 @@ struct bvec_iter {
.bv_offset  = bvec_iter_offset((bvec), (iter)), \
 })
 
-static inline bool bvec_iter_advance(const struct bio_vec *bv,
-   struct bvec_iter *iter, unsigned bytes)
+#define mp_bvec_iter_bvec(bvec, iter)  \
+((struct bio_vec) {\
+   .bv_page= mp_bvec_iter_page((bvec), (iter)),\
+   .bv_len = mp_bvec_iter_len((bvec), (iter)), \
+   .bv_offset  = mp_bvec_iter_offset((bvec), (iter)),  \
+})
+
+static inline bool __bvec_iter_advance(const struct bio_vec *bv,
+  struct bvec_iter *iter,
+  unsigned bytes, bool mp)
 {
if (WARN_ONCE(bytes > iter->bi_size,
 "Attempted to advance past end of bvec iter\n")) {
@@ -134,8 +142,14 @@ static inline bool bvec_iter_advance(const struct bio_vec 
*bv,
}
 
while (bytes) {
-   unsigned iter_len = bvec_iter_len(bv, *iter);
-   unsigned len = min(bytes, 

[dm-devel] [PATCH V10 05/19] block: introduce bvec_last_segment()

2018-11-15 Thread Ming Lei
BTRFS and guard_bio_eod() need to get the last singlepage segment
from one multipage bvec, so introduce this helper to make them happy.

Cc: Dave Chinner 
Cc: Kent Overstreet 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: Alexander Viro 
Cc: linux-fsde...@vger.kernel.org
Cc: Shaohua Li 
Cc: linux-r...@vger.kernel.org
Cc: linux-er...@lists.ozlabs.org
Cc: David Sterba 
Cc: linux-bt...@vger.kernel.org
Cc: Darrick J. Wong 
Cc: linux-...@vger.kernel.org
Cc: Gao Xiang 
Cc: Christoph Hellwig 
Cc: Theodore Ts'o 
Cc: linux-e...@vger.kernel.org
Cc: Coly Li 
Cc: linux-bca...@vger.kernel.org
Cc: Boaz Harrosh 
Cc: Bob Peterson 
Cc: cluster-de...@redhat.com
Signed-off-by: Ming Lei 
---
 include/linux/bvec.h | 25 +
 1 file changed, 25 insertions(+)

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 3d61352cd8cf..01616a0b6220 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -216,4 +216,29 @@ static inline bool mp_bvec_iter_advance(const struct 
bio_vec *bv,
.bi_bvec_done   = 0,\
 }
 
+/*
+ * Get the last singlepage segment from the multipage bvec and store it
+ * in @seg
+ */
+static inline void bvec_last_segment(const struct bio_vec *bvec,
+   struct bio_vec *seg)
+{
+   unsigned total = bvec->bv_offset + bvec->bv_len;
+   unsigned last_page = total / PAGE_SIZE;
+
+   if (last_page * PAGE_SIZE == total)
+   last_page--;
+
+   seg->bv_page = nth_page(bvec->bv_page, last_page);
+
+   /* the whole segment is inside the last page */
+   if (bvec->bv_offset >= last_page * PAGE_SIZE) {
+   seg->bv_offset = bvec->bv_offset % PAGE_SIZE;
+   seg->bv_len = bvec->bv_len;
+   } else {
+   seg->bv_offset = 0;
+   seg->bv_len = total - last_page * PAGE_SIZE;
+   }
+}
+
 #endif /* __LINUX_BVEC_ITER_H */
-- 
2.9.5

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel