date:20170531

Re: [PATCH v2 5/6] cdrom: Check SCSI passthrough support before reading audio

2017-05-31 Thread Hannes Reinecke

On 05/31/2017 11:43 PM, Bart Van Assche wrote:
> The CDROMREADAUDIO ioctl uses SCSI passthrough when the .disk
> pointer has been set in struct cdrom_device_info. Hence check
> whether SCSI passthrough is supported before submitting a SCSI
> command. Note: both the ide-cd and sr drivers set the disk
> pointer in struct cdrom_device_info but neither the pcd nor
> the gdrom driver sets that pointer.
> 
> References: commit 82ed4db499b8 ("block: split scsi_request out of struct 
> request")
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Omar Sandoval 
> Cc: linux-block@vger.kernel.org
> ---
>  drivers/cdrom/cdrom.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c
> index 76c952fd9ab9..ff19cfc587f0 100644
> --- a/drivers/cdrom/cdrom.c
> +++ b/drivers/cdrom/cdrom.c
> @@ -2178,6 +2178,12 @@ static int cdrom_read_cdda_bpc(struct 
> cdrom_device_info *cdi, __u8 __user *ubuf,
>   if (!q)
>   return -ENXIO;
>  
> + if (!blk_queue_scsi_passthrough(q)) {
> + WARN_ONCE(true,
> +   "Attempt read CDDA info through a non-SCSI queue\n");
> + return -EINVAL;
> + }
> +
>   cdi->last_sense = 0;
>  
>   while (nframes) {
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH v2 4/4] blk-mq-debugfs: Add 'kick' operation

2017-05-31 Thread Hannes Reinecke

On 05/31/2017 11:30 PM, Bart Van Assche wrote:
> Running a queue causes the block layer to examine the per-CPU and
> hw queues but not the requeue list. Hence add a 'kick' operation
> that also examines the requeue list.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Omar Sandoval 
> Cc: Ming Lei 
> ---
>  block/blk-mq-debugfs.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index fa0f624dfccd..962c8417809d 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -114,10 +114,12 @@ static ssize_t queue_state_write(void *data, const char 
> __user *buf,
>   blk_mq_run_hw_queues(q, true);
>   } else if (strcmp(op, "start") == 0) {
>   blk_mq_start_stopped_hw_queues(q, true);
> + } else if (strcmp(op, "kick") == 0) {
> + blk_mq_kick_requeue_list(q);
>   } else {
>   pr_err("%s: unsupported operation '%s'\n", __func__, op);
>  inval:
> - pr_err("%s: use either 'run' or 'start'\n", __func__);
> + pr_err("%s: use 'run', 'start' or 'kick'\n", __func__);
>   return -EINVAL;
>   }
>   return count;
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

Re: [PATCH v2 3/4] blk-mq-debugfs: Show busy requests

2017-05-31 Thread Hannes Reinecke

On 05/31/2017 11:30 PM, Bart Van Assche wrote:
> Requests that got stuck in a block driver are neither on
> blk_mq_ctx.rq_list nor on any hw dispatch queue. Make these
> visible in debugfs through the "busy" attribute.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Omar Sandoval 
> Cc: Ming Lei 
> ---
>  block/blk-mq-debugfs.c | 26 ++
>  1 file changed, 26 insertions(+)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

blk-mq: free callback if error occurs in blk_mq_init_allocated_queue

2017-05-31 Thread Joseph Qi

From: Joseph Qi 

If error occurs, we have to free the allocated callback in
blk_mq_init_allocated_queue to avoid memory leaking.

Fixes: 34dbad5d26e2 ("blk-stat: convert to callback-based statistics reporting")
Signed-off-by: Joseph Qi 
---
 block/blk-mq.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f2224ffd..50b5d7b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2273,7 +2273,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
 
q->queue_ctx = alloc_percpu(struct blk_mq_ctx);
if (!q->queue_ctx)
-   goto err_exit;
+   goto err_poll_cb;
 
/* init q->mq_kobj and sw queues' kobjects */
blk_mq_sysfs_init(q);
@@ -2346,6 +2346,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
kfree(q->queue_hw_ctx);
 err_percpu:
free_percpu(q->queue_ctx);
+err_poll_cb:
+   blk_stat_free_callback(q->poll_cb);
 err_exit:
q->mq_ops = NULL;
return ERR_PTR(-ENOMEM);
-- 
1.8.3.1

Re: [PATCH v2 4/4] blk-mq-debugfs: Add 'kick' operation

2017-05-31 Thread Ming Lei

On Wed, May 31, 2017 at 02:30:50PM -0700, Bart Van Assche wrote:
> Running a queue causes the block layer to examine the per-CPU and
> hw queues but not the requeue list. Hence add a 'kick' operation
> that also examines the requeue list.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Omar Sandoval 
> Cc: Ming Lei 
> ---
>  block/blk-mq-debugfs.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index fa0f624dfccd..962c8417809d 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -114,10 +114,12 @@ static ssize_t queue_state_write(void *data, const char 
> __user *buf,
>   blk_mq_run_hw_queues(q, true);
>   } else if (strcmp(op, "start") == 0) {
>   blk_mq_start_stopped_hw_queues(q, true);
> + } else if (strcmp(op, "kick") == 0) {
> + blk_mq_kick_requeue_list(q);
>   } else {
>   pr_err("%s: unsupported operation '%s'\n", __func__, op);
>  inval:
> - pr_err("%s: use either 'run' or 'start'\n", __func__);
> + pr_err("%s: use 'run', 'start' or 'kick'\n", __func__);
>   return -EINVAL;
>   }
>   return count;
> -- 
> 2.12.2

Reviewed-by: Ming Lei 

Thanks,
Ming

Re: [PATCH v2 3/4] blk-mq-debugfs: Show busy requests

2017-05-31 Thread Ming Lei

On Wed, May 31, 2017 at 02:30:49PM -0700, Bart Van Assche wrote:
> Requests that got stuck in a block driver are neither on
> blk_mq_ctx.rq_list nor on any hw dispatch queue. Make these
> visible in debugfs through the "busy" attribute.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Omar Sandoval 
> Cc: Ming Lei 
> ---
>  block/blk-mq-debugfs.c | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 8b06a12c1461..fa0f624dfccd 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -370,6 +370,31 @@ static const struct seq_operations hctx_dispatch_seq_ops 
> = {
>   .show   = blk_mq_debugfs_rq_show,
>  };
>  
> +struct show_busy_params {
> + struct seq_file *m;
> + struct blk_mq_hw_ctx*hctx;
> +};
> +
> +static void hctx_show_busy(struct request *rq, void *data, bool reserved)
> +{
> + const struct show_busy_params *params = data;
> +
> + if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
> + test_bit(REQ_ATOM_STARTED, >atomic_flags))
> + __blk_mq_debugfs_rq_show(params->m,
> +  list_entry_rq(>queuelist));
> +}

Not like dumping requests in ctx and requeue list, the dumped requests
here may have been released and the result may not be 100% reliable,
so suggest to add comment for this fact.

Otherwise, looks fine for me.

Thanks,
Ming

Re: [PATCH v2 2/4] blk-mq-debugfs: Show requeue list

2017-05-31 Thread Ming Lei

On Wed, May 31, 2017 at 02:30:48PM -0700, Bart Van Assche wrote:
> When verifying whether or not a blk-mq driver forgot to kick the
> requeue list after having requeued a request it is important to
> be able to verify the contents of the requeue list. Hence export
> that list through debugfs.
> 
> Signed-off-by: Bart Van Assche 
> Reviewed-by: Hannes Reinecke 
> Cc: Christoph Hellwig 
> Cc: Omar Sandoval 
> Cc: Ming Lei 
> ---
>  block/blk-mq-debugfs.c | 32 
>  1 file changed, 32 insertions(+)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index d56ddd7a1285..8b06a12c1461 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -308,6 +308,37 @@ int blk_mq_debugfs_rq_show(struct seq_file *m, void *v)
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_debugfs_rq_show);
>  
> +static void *queue_requeue_list_start(struct seq_file *m, loff_t *pos)
> + __acquires(>requeue_lock)
> +{
> + struct request_queue *q = m->private;
> +
> + spin_lock_irq(>requeue_lock);
> + return seq_list_start(>requeue_list, *pos);
> +}
> +
> +static void *queue_requeue_list_next(struct seq_file *m, void *v, loff_t 
> *pos)
> +{
> + struct request_queue *q = m->private;
> +
> + return seq_list_next(v, >requeue_list, pos);
> +}
> +
> +static void queue_requeue_list_stop(struct seq_file *m, void *v)
> + __releases(>requeue_lock)
> +{
> + struct request_queue *q = m->private;
> +
> + spin_unlock_irq(>requeue_lock);
> +}
> +
> +static const struct seq_operations queue_requeue_list_seq_ops = {
> + .start  = queue_requeue_list_start,
> + .next   = queue_requeue_list_next,
> + .stop   = queue_requeue_list_stop,
> + .show   = blk_mq_debugfs_rq_show,
> +};
> +
>  static void *hctx_dispatch_start(struct seq_file *m, loff_t *pos)
>   __acquires(>lock)
>  {
> @@ -665,6 +696,7 @@ const struct file_operations blk_mq_debugfs_fops = {
>  
>  static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
>   {"poll_stat", 0400, queue_poll_stat_show},
> + {"requeue_list", 0400, .seq_ops = _requeue_list_seq_ops},
>   {"state", 0600, queue_state_show, queue_state_write},
>   {},
>  };
> -- 
> 2.12.2
> 

Reviewed-by: Ming Lei 

thanks,
Ming

Re: [PATCH v2 1/4] blk-mq-debugfs: Show atomic request flags

2017-05-31 Thread Ming Lei

On Wed, May 31, 2017 at 02:30:47PM -0700, Bart Van Assche wrote:
> When analyzing e.g. queue lockups it is important to know whether
> or not a request has already been started. Hence also show the
> atomic request flags.
> 
> Signed-off-by: Bart Van Assche 
> Reviewed-by: Hannes Reinecke 
> Cc: Omar Sandoval 
> Cc: Christoph Hellwig 
> Cc: Ming Lei 
> ---
>  block/blk-mq-debugfs.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 803aed4d7221..d56ddd7a1285 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -267,6 +267,14 @@ static const char *const rqf_name[] = {
>  };
>  #undef RQF_NAME
>  
> +#define RQAF_NAME(name) [REQ_ATOM_##name] = #name
> +static const char *const rqaf_name[] = {
> + RQAF_NAME(COMPLETE),
> + RQAF_NAME(STARTED),
> + RQAF_NAME(POLL_SLEPT),
> +};
> +#undef RQAF_NAME
> +
>  int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq)
>  {
>   const struct blk_mq_ops *const mq_ops = rq->q->mq_ops;
> @@ -283,6 +291,8 @@ int __blk_mq_debugfs_rq_show(struct seq_file *m, struct 
> request *rq)
>   seq_puts(m, ", .rq_flags=");
>   blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
>  ARRAY_SIZE(rqf_name));
> + seq_puts(m, ", .atomic_flags=");
> + blk_flags_show(m, rq->atomic_flags, rqaf_name, ARRAY_SIZE(rqaf_name));
>   seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
>  rq->internal_tag);
>   if (mq_ops->show_rq)

Reviewed-by: Ming Lei 

Thanks,
Ming

Re: [PATCH v3 5/9] blk-mq: fix blk_mq_quiesce_queue

2017-05-31 Thread Ming Lei

On Wed, May 31, 2017 at 03:37:30PM +, Bart Van Assche wrote:
> On Wed, 2017-05-31 at 20:37 +0800, Ming Lei wrote:
> > 
> > +   /* wait until queue is unquiesced */
> > +   wait_event_cmd(q->quiesce_wq, !blk_queue_quiesced(q),
> > +   may_sleep ?
> > +   srcu_read_unlock(>queue_rq_srcu, *srcu_idx) :
> > +   rcu_read_unlock(),
> > +   may_sleep ?
> > +   *srcu_idx = srcu_read_lock(>queue_rq_srcu) :
> > +   rcu_read_lock());
> > +
> > if (q->elevator)
> > goto insert;
> 
> What I see is that in this patch a new waitqueue has been introduced
> (quiesce_wq) and also that an explanation of why you think this new waitqueue
> is needed is missing completely. Why is it that you think that the
> synchronize_scru() and synchronize_rcu() calls in blk_mq_quiesce_queue() are
> not sufficient? If this new waitqueue is not needed, please remove that
> waitqueue again.

OK, the reason is simple, and it is only related with direct issue.

Under this situation, when the queue is quiesced, we have to

- insert the current request into sw queue(scheduler queue)
OR
-wait until queue becomes unquiesced like what this patch is doing

The disadvantage of the 1st way is that we have to consider to run queue
again in blk_mq_unquiesce_queue() for the queued requests during quiescing.

For the 2nd way(what this patch is doing), one benefit is that application
can avoid to submit I/O to a quiesced queue. Another benefit is that we
needn't to consider to run queue in blk_mq_unquiesce_queue(). But with cost
of one waitqueue, the cost should be cheap, but if you persist on the 1st
approach, I am fine to change to that.

Thanks,
Ming

Re: [PATCH v3 3/9] blk-mq: use the introduced blk_mq_unquiesce_queue()

2017-05-31 Thread Ming Lei

On Wed, May 31, 2017 at 03:21:41PM +, Bart Van Assche wrote:
> On Wed, 2017-05-31 at 20:37 +0800, Ming Lei wrote:
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index 99e16ac479e3..ffcf05765e2b 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -3031,7 +3031,10 @@ scsi_internal_device_unblock(struct scsi_device 
> > *sdev,
> > return -EINVAL;
> >  
> > if (q->mq_ops) {
> > -   blk_mq_start_stopped_hw_queues(q, false);
> > +   if (blk_queue_quiesced(q))
> > +   blk_mq_unquiesce_queue(q);
> > +   else
> > +   blk_mq_start_stopped_hw_queues(q, false);
> > } else {
> > spin_lock_irqsave(q->queue_lock, flags);
> > blk_start_queue(q);
> 
> As I commented on v2, this change is really wrong. All what's needed here is
> a call to blk_mq_unquiesce_queue() and nothing else. Adding a call to
> blk_mq_start_stopped_hw_queues() is wrong because it makes it impossible to
> use the STOPPED flag in the SCSI core to make the block layer core stop 
> calling
> .queue_rq() if a SCSI LLD returns "busy".

I am not sure if I understand your idea, could you explain a bit why it is 
wrong?

Let's see the function of scsi_internal_device_block():

if (q->mq_ops) {
if (wait)
blk_mq_quiesce_queue(q);
else
blk_mq_stop_hw_queues(q);
}

So the queue may be put into quiesced if 'wait' is true, or it is
stopped if 'wait' is false.

And this patch just makes the two SCSI APIs symmetrical.

Since we will not stop queue in blk_mq_quiesce_queue() later,
I have to unquiese one queue only if it is quiesced.

So suppose the queue is put into stopped in scsi_internal_device_block(),
do we expect not to restart it in scsi_internal_device_unblock() via
blk_mq_unquiesce_queue()?

Thanks,
Ming

Re: [PATCH v2 3/4] blk-mq-debugfs: Show busy requests

2017-05-31 Thread Eduardo Valentin

On Wed, May 31, 2017 at 09:54:11PM +, Bart Van Assche wrote:
> On Wed, 2017-05-31 at 14:49 -0700, Eduardo Valentin wrote:
> > On Wed, May 31, 2017 at 09:45:54PM +, Bart Van Assche wrote:
> > > On Wed, 2017-05-31 at 14:43 -0700, Eduardo Valentin wrote:
> > > > On Wed, May 31, 2017 at 02:30:49PM -0700, Bart Van Assche wrote:
> > > > > +static void hctx_show_busy(struct request *rq, void *data, bool 
> > > > > reserved)
> > > > > +{
> > > > > + const struct show_busy_params *params = data;
> > > > > +
> > > > > + if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
> > > > > + test_bit(REQ_ATOM_STARTED, >atomic_flags))
> > > > > + __blk_mq_debugfs_rq_show(params->m,
> > > > > +  list_entry_rq(>queuelist));
> > > > > +}
> > > > > +
> > > > > +static int hctx_busy_show(void *data, struct seq_file *m)
> > > > > +{
> > > > > + struct blk_mq_hw_ctx *hctx = data;
> > > > > + struct show_busy_params params = { .m = m, .hctx = hctx };
> > > > > +
> > > > > + blk_mq_tagset_busy_iter(hctx->queue->tag_set, hctx_show_busy, 
> > > > > );
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > 
> > > > Why not making the two above one single function?
> > > > hctx_busy_show vs. hctx_show_busy seams a bit confusing, and I could 
> > > > not see
> > > > where they get reused in your patch set..
> > > 
> > > Hello Eduardo,
> > > 
> > > If I would open-code blk_mq_tagset_busy_iter() then I would be able to 
> > > implement
> > > the above two functions as a single function. However, 
> > > blk_mq_tagset_busy_iter()
> > > expects a function pointer as third argument. That's why the above 
> > > functionality
> > > has been split over two functions.
> > 
> > Yeah, my bad here. I misread the functions. But still the naming doesnt seam
> > too suggestive? how about s/hctx_show_busy/hctx_busy_entry/g?
> 
> Hello Eduardo,
> 
> Since that function shows information about a single request, how about
> hctx_show_busy_rq()?

Sounds good to me.

> 
> Bart.

-- 
All the best,
Eduardo Valentin

[PATCH v2 05/12] blk-mq: Initialize a request before assigning a tag

2017-05-31 Thread Bart Van Assche

Initialization of blk-mq requests is a bit weird: blk_mq_rq_ctx_init()
is called after a value has been assigned to .rq_flags and .rq_flags
is initialized in __blk_mq_finish_request(). Call blk_mq_rq_ctx_init()
before modifying any struct request members. Initialize .rq_flags in
blk_mq_rq_ctx_init() instead of relying on __blk_mq_finish_request().
Moving the initialization of .rq_flags is fine because all changes
and tests of .rq_flags occur between blk_get_request() and finishing
a request.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Omar Sandoval 
Cc: Ming Lei 
---
 block/blk-mq.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9aa1754e938b..488c6ca2ad91 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -212,6 +212,7 @@ void blk_mq_rq_ctx_init(struct request_queue *q, struct 
blk_mq_ctx *ctx,
rq->q = q;
rq->mq_ctx = ctx;
rq->cmd_flags = op;
+   rq->rq_flags = 0;
if (blk_queue_io_stat(q))
rq->rq_flags |= RQF_IO_STAT;
/* do not touch atomic flags, it needs atomic ops against the timer */
@@ -231,7 +232,7 @@ void blk_mq_rq_ctx_init(struct request_queue *q, struct 
blk_mq_ctx *ctx,
rq->nr_integrity_segments = 0;
 #endif
rq->special = NULL;
-   /* tag was already set */
+   /* tag will be set by caller */
rq->extra_len = 0;
 
INIT_LIST_HEAD(>timeout_list);
@@ -257,12 +258,14 @@ struct request *__blk_mq_alloc_request(struct 
blk_mq_alloc_data *data,
 
rq = tags->static_rqs[tag];
 
+   blk_mq_rq_ctx_init(data->q, data->ctx, rq, op);
+
if (data->flags & BLK_MQ_REQ_INTERNAL) {
rq->tag = -1;
rq->internal_tag = tag;
} else {
if (blk_mq_tag_busy(data->hctx)) {
-   rq->rq_flags = RQF_MQ_INFLIGHT;
+   rq->rq_flags |= RQF_MQ_INFLIGHT;
atomic_inc(>hctx->nr_active);
}
rq->tag = tag;
@@ -270,7 +273,6 @@ struct request *__blk_mq_alloc_request(struct 
blk_mq_alloc_data *data,
data->hctx->tags->rqs[rq->tag] = rq;
}
 
-   blk_mq_rq_ctx_init(data->q, data->ctx, rq, op);
return rq;
}
 
@@ -361,7 +363,6 @@ void __blk_mq_finish_request(struct blk_mq_hw_ctx *hctx, 
struct blk_mq_ctx *ctx,
atomic_dec(>nr_active);
 
wbt_done(q->rq_wb, >issue_stat);
-   rq->rq_flags = 0;
 
clear_bit(REQ_ATOM_STARTED, >atomic_flags);
clear_bit(REQ_ATOM_POLL_SLEPT, >atomic_flags);
-- 
2.12.2

[PATCH v2 01/12] block: Make request operation type argument declarations consistent

2017-05-31 Thread Bart Van Assche

Instead of declaring the second argument of blk_*_get_request()
as int and passing it to functions that expect an unsigned int,
declare that second argument as unsigned int. Also because of
consistency, rename that second argument from 'rw' into 'op'.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Omar Sandoval 
Cc: Ming Lei 
---
 block/blk-core.c   | 13 +++--
 block/blk-mq.c | 10 +-
 include/linux/blk-mq.h |  6 +++---
 include/linux/blkdev.h |  3 ++-
 4 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a7421b772d0e..3bc431a77309 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1283,8 +1283,8 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
goto retry;
 }
 
-static struct request *blk_old_get_request(struct request_queue *q, int rw,
-   gfp_t gfp_mask)
+static struct request *blk_old_get_request(struct request_queue *q,
+  unsigned int op, gfp_t gfp_mask)
 {
struct request *rq;
 
@@ -1292,7 +1292,7 @@ static struct request *blk_old_get_request(struct 
request_queue *q, int rw,
create_io_context(gfp_mask, q->node);
 
spin_lock_irq(q->queue_lock);
-   rq = get_request(q, rw, NULL, gfp_mask);
+   rq = get_request(q, op, NULL, gfp_mask);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
return rq;
@@ -1305,14 +1305,15 @@ static struct request *blk_old_get_request(struct 
request_queue *q, int rw,
return rq;
 }
 
-struct request *blk_get_request(struct request_queue *q, int rw, gfp_t 
gfp_mask)
+struct request *blk_get_request(struct request_queue *q, unsigned int op,
+   gfp_t gfp_mask)
 {
if (q->mq_ops)
-   return blk_mq_alloc_request(q, rw,
+   return blk_mq_alloc_request(q, op,
(gfp_mask & __GFP_DIRECT_RECLAIM) ?
0 : BLK_MQ_REQ_NOWAIT);
else
-   return blk_old_get_request(q, rw, gfp_mask);
+   return blk_old_get_request(q, op, gfp_mask);
 }
 EXPORT_SYMBOL(blk_get_request);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e068a26173fc..9aa1754e938b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -278,7 +278,7 @@ struct request *__blk_mq_alloc_request(struct 
blk_mq_alloc_data *data,
 }
 EXPORT_SYMBOL_GPL(__blk_mq_alloc_request);
 
-struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
+struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
unsigned int flags)
 {
struct blk_mq_alloc_data alloc_data = { .flags = flags };
@@ -289,7 +289,7 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, int rw,
if (ret)
return ERR_PTR(ret);
 
-   rq = blk_mq_sched_get_request(q, NULL, rw, _data);
+   rq = blk_mq_sched_get_request(q, NULL, op, _data);
 
blk_mq_put_ctx(alloc_data.ctx);
blk_queue_exit(q);
@@ -304,8 +304,8 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, int rw,
 }
 EXPORT_SYMBOL(blk_mq_alloc_request);
 
-struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int rw,
-   unsigned int flags, unsigned int hctx_idx)
+struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
+   unsigned int op, unsigned int flags, unsigned int hctx_idx)
 {
struct blk_mq_alloc_data alloc_data = { .flags = flags };
struct request *rq;
@@ -340,7 +340,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q, int rw,
cpu = cpumask_first(alloc_data.hctx->cpumask);
alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
 
-   rq = blk_mq_sched_get_request(q, NULL, rw, _data);
+   rq = blk_mq_sched_get_request(q, NULL, op, _data);
 
blk_queue_exit(q);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index c534ec64e214..a4759fd34e7e 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -205,10 +205,10 @@ enum {
BLK_MQ_REQ_INTERNAL = (1 << 2), /* allocate internal/sched tag */
 };
 
-struct request *blk_mq_alloc_request(struct request_queue *q, int rw,
+struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
unsigned int flags);
-struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int op,
-   unsigned int flags, unsigned int hctx_idx);
+struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
+   unsigned int op, unsigned int flags, unsigned int hctx_idx);
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag);
 
 enum {
diff --git a/include/linux/blkdev.h

[PATCH v2 12/12] block: Rename blk_mq_rq_{to,from}_pdu()

2017-05-31 Thread Bart Van Assche

Commit 6d247d7f71d1 ("block: allow specifying size for extra command
data") added support for .cmd_size to blk-sq. Due to that patch the
blk_mq_rq_{to,from}_pdu() functions are also useful for single-queue
block drivers. Hence remove "_mq" from the name of these functions.
This patch does not change any functionality. Most of this patch has
been generated by running the following shell command:

sed -i 's/blk_mq_rq_to_pdu/blk_rq_to_pdu/g;
s/blk_mq_rq_from_pdu/blk_rq_from_pdu/g' \
$(git grep -lE 'blk_mq_rq_(to|from)_pdu')

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Omar Sandoval 
---
 drivers/block/loop.c  |  8 
 drivers/block/mtip32xx/mtip32xx.c | 28 ++--
 drivers/block/nbd.c   | 18 +-
 drivers/block/null_blk.c  |  4 ++--
 drivers/block/rbd.c   |  6 +++---
 drivers/block/virtio_blk.c| 12 ++--
 drivers/block/xen-blkfront.c  |  2 +-
 drivers/ide/ide-probe.c   |  2 +-
 drivers/md/dm-rq.c|  6 +++---
 drivers/mtd/ubi/block.c   |  8 
 drivers/nvme/host/fc.c| 20 ++--
 drivers/nvme/host/nvme.h  |  2 +-
 drivers/nvme/host/pci.c   | 22 +++---
 drivers/nvme/host/rdma.c  | 18 +-
 drivers/nvme/target/loop.c| 10 +-
 drivers/scsi/scsi_lib.c   | 18 +-
 include/linux/blk-mq.h| 13 -
 include/linux/blkdev.h| 13 +
 include/linux/ide.h   |  2 +-
 include/scsi/scsi_request.h   |  2 +-
 20 files changed, 107 insertions(+), 107 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 28d932906f24..42e18601daa2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -447,7 +447,7 @@ static int lo_req_flush(struct loop_device *lo, struct 
request *rq)
 
 static void lo_complete_rq(struct request *rq)
 {
-   struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+   struct loop_cmd *cmd = blk_rq_to_pdu(rq);
 
if (unlikely(req_op(cmd->rq) == REQ_OP_READ && cmd->use_aio &&
 cmd->ret >= 0 && cmd->ret < blk_rq_bytes(cmd->rq))) {
@@ -507,7 +507,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
 
 static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 {
-   struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+   struct loop_cmd *cmd = blk_rq_to_pdu(rq);
loff_t pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
 
/*
@@ -1645,7 +1645,7 @@ EXPORT_SYMBOL(loop_unregister_transfer);
 static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
 {
-   struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
+   struct loop_cmd *cmd = blk_rq_to_pdu(bd->rq);
struct loop_device *lo = cmd->rq->q->queuedata;
 
blk_mq_start_request(bd->rq);
@@ -1700,7 +1700,7 @@ static void loop_queue_work(struct kthread_work *work)
 static int loop_init_request(struct blk_mq_tag_set *set, struct request *rq,
unsigned int hctx_idx, unsigned int numa_node)
 {
-   struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+   struct loop_cmd *cmd = blk_rq_to_pdu(rq);
 
cmd->rq = rq;
kthread_init_work(>work, loop_queue_work);
diff --git a/drivers/block/mtip32xx/mtip32xx.c 
b/drivers/block/mtip32xx/mtip32xx.c
index 3a779a4f5653..7b58a5a16324 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -173,7 +173,7 @@ static bool mtip_check_surprise_removal(struct pci_dev 
*pdev)
 static void mtip_init_cmd_header(struct request *rq)
 {
struct driver_data *dd = rq->q->queuedata;
-   struct mtip_cmd *cmd = blk_mq_rq_to_pdu(rq);
+   struct mtip_cmd *cmd = blk_rq_to_pdu(rq);
u32 host_cap_64 = readl(dd->mmio + HOST_CAP) & HOST_CAP_64;
 
/* Point the command headers at the command tables. */
@@ -202,7 +202,7 @@ static struct mtip_cmd *mtip_get_int_command(struct 
driver_data *dd)
/* Internal cmd isn't submitted via .queue_rq */
mtip_init_cmd_header(rq);
 
-   return blk_mq_rq_to_pdu(rq);
+   return blk_rq_to_pdu(rq);
 }
 
 static struct mtip_cmd *mtip_cmd_from_tag(struct driver_data *dd,
@@ -210,7 +210,7 @@ static struct mtip_cmd *mtip_cmd_from_tag(struct 
driver_data *dd,
 {
struct blk_mq_hw_ctx *hctx = dd->queue->queue_hw_ctx[0];
 
-   return blk_mq_rq_to_pdu(blk_mq_tag_to_rq(hctx->tags, tag));
+   return blk_rq_to_pdu(blk_mq_tag_to_rq(hctx->tags, tag));
 }
 
 /*
@@ -534,7 +534,7 @@ static int mtip_get_smart_attr(struct mtip_port *port, 
unsigned int id,
 
 static void mtip_complete_command(struct mtip_cmd *cmd, int status)
 {
-   struct request *req = blk_mq_rq_from_pdu(cmd);
+   struct request *req =

[PATCH v2 04/12] block: Change argument type of scsi_req_init()

2017-05-31 Thread Bart Van Assche

Since scsi_req_init() works on a struct scsi_request, change the
argument type into struct scsi_request *.

Signed-off-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Martin K. Petersen 
---
 block/scsi_ioctl.c| 10 +++---
 drivers/ide/ide-atapi.c   |  2 +-
 drivers/ide/ide-probe.c   |  2 +-
 drivers/scsi/scsi_lib.c   |  4 +++-
 drivers/scsi/scsi_transport_sas.c |  2 +-
 include/scsi/scsi_request.h   |  2 +-
 6 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index f96c51f5df40..7440de44dd85 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -741,10 +741,14 @@ int scsi_cmd_blk_ioctl(struct block_device *bd, fmode_t 
mode,
 }
 EXPORT_SYMBOL(scsi_cmd_blk_ioctl);
 
-void scsi_req_init(struct request *rq)
+/**
+ * scsi_req_init - initialize certain fields of a scsi_request structure
+ * @req: Pointer to a scsi_request structure.
+ * Initializes .__cmd[], .cmd, .cmd_len and .sense_len but no other members
+ * of struct scsi_request.
+ */
+void scsi_req_init(struct scsi_request *req)
 {
-   struct scsi_request *req = scsi_req(rq);
-
memset(req->__cmd, 0, sizeof(req->__cmd));
req->cmd = req->__cmd;
req->cmd_len = BLK_MAX_CDB;
diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
index 98e78b520417..5ffecef8b910 100644
--- a/drivers/ide/ide-atapi.c
+++ b/drivers/ide/ide-atapi.c
@@ -199,7 +199,7 @@ void ide_prep_sense(ide_drive_t *drive, struct request *rq)
memset(sense, 0, sizeof(*sense));
 
blk_rq_init(rq->q, sense_rq);
-   scsi_req_init(sense_rq);
+   scsi_req_init(req);
 
err = blk_rq_map_kern(drive->queue, sense_rq, sense, sense_len,
  GFP_NOIO);
diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
index c60e5ffc9231..01b2adfd8226 100644
--- a/drivers/ide/ide-probe.c
+++ b/drivers/ide/ide-probe.c
@@ -745,7 +745,7 @@ static void ide_initialize_rq(struct request *rq)
 {
struct ide_request *req = blk_mq_rq_to_pdu(rq);
 
-   scsi_req_init(rq);
+   scsi_req_init(>sreq);
req->sreq.sense = req->sense;
 }
 
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index e96ffd187558..b629d8cbf0d1 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1136,7 +1136,9 @@ EXPORT_SYMBOL(scsi_init_io);
 /* Called from inside blk_get_request() */
 static void scsi_initialize_rq(struct request *rq)
 {
-   scsi_req_init(rq);
+   struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
+
+   scsi_req_init(>req);
 }
 
 /* Called after a request has been started. */
diff --git a/drivers/scsi/scsi_transport_sas.c 
b/drivers/scsi/scsi_transport_sas.c
index f5449da6fcad..35598905d785 100644
--- a/drivers/scsi/scsi_transport_sas.c
+++ b/drivers/scsi/scsi_transport_sas.c
@@ -215,7 +215,7 @@ static void sas_host_release(struct device *dev)
 
 static void sas_initialize_rq(struct request *rq)
 {
-   scsi_req_init(rq);
+   scsi_req_init(scsi_req(rq));
 }
 
 static int sas_bsg_initialize(struct Scsi_Host *shost, struct sas_rphy *rphy)
diff --git a/include/scsi/scsi_request.h b/include/scsi/scsi_request.h
index f0c76f9dc285..e0afa445ee4e 100644
--- a/include/scsi/scsi_request.h
+++ b/include/scsi/scsi_request.h
@@ -27,6 +27,6 @@ static inline void scsi_req_free_cmd(struct scsi_request *req)
kfree(req->cmd);
 }
 
-void scsi_req_init(struct request *);
+void scsi_req_init(struct scsi_request *req);
 
 #endif /* _SCSI_SCSI_REQUEST_H */
-- 
2.12.2

[PATCH v2 06/12] block: Add a comment above queue_lockdep_assert_held()

2017-05-31 Thread Bart Van Assche

Add a comment above the queue_lockdep_assert_held() macro that
explains the purpose of the q->queue_lock test.

Signed-off-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Omar Sandoval 
Cc: Ming Lei 
---
 include/linux/blkdev.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cbc0028290e4..1e73b4df13a9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -634,6 +634,13 @@ struct request_queue {
 (1 << QUEUE_FLAG_SAME_COMP)|   \
 (1 << QUEUE_FLAG_POLL))
 
+/*
+ * @q->queue_lock is set while a queue is being initialized. Since we know
+ * that no other threads access the queue object before @q->queue_lock has
+ * been set, it is safe to manipulate queue flags without holding the
+ * queue_lock if @q->queue_lock == NULL. See also blk_alloc_queue_node() and
+ * blk_init_allocated_queue().
+ */
 static inline void queue_lockdep_assert_held(struct request_queue *q)
 {
if (q->queue_lock)
-- 
2.12.2

[PATCH v2 08/12] block: Document what queue type each function is intended for

2017-05-31 Thread Bart Van Assche

Some functions in block/blk-core.c must only be used on blk-sq queues
while others are safe to use against any queue type. Document which
functions are intended for blk-sq queues and issue a warning if the
blk-sq API is misused.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Omar Sandoval 
Cc: Ming Lei 
---
 block/blk-core.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index f3ad963eccdd..4689c20943fb 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -182,6 +182,7 @@ static void blk_delay_work(struct work_struct *work)
 void blk_delay_queue(struct request_queue *q, unsigned long msecs)
 {
lockdep_assert_held(q->queue_lock);
+   WARN_ON_ONCE(q->mq_ops);
 
if (likely(!blk_queue_dead(q)))
queue_delayed_work(kblockd_workqueue, >delay_work,
@@ -201,6 +202,7 @@ EXPORT_SYMBOL(blk_delay_queue);
 void blk_start_queue_async(struct request_queue *q)
 {
lockdep_assert_held(q->queue_lock);
+   WARN_ON_ONCE(q->mq_ops);
 
queue_flag_clear(QUEUE_FLAG_STOPPED, q);
blk_run_queue_async(q);
@@ -220,6 +222,7 @@ void blk_start_queue(struct request_queue *q)
 {
lockdep_assert_held(q->queue_lock);
WARN_ON(!irqs_disabled());
+   WARN_ON_ONCE(q->mq_ops);
 
queue_flag_clear(QUEUE_FLAG_STOPPED, q);
__blk_run_queue(q);
@@ -243,6 +246,7 @@ EXPORT_SYMBOL(blk_start_queue);
 void blk_stop_queue(struct request_queue *q)
 {
lockdep_assert_held(q->queue_lock);
+   WARN_ON_ONCE(q->mq_ops);
 
cancel_delayed_work(>delay_work);
queue_flag_set(QUEUE_FLAG_STOPPED, q);
@@ -297,6 +301,7 @@ EXPORT_SYMBOL(blk_sync_queue);
 inline void __blk_run_queue_uncond(struct request_queue *q)
 {
lockdep_assert_held(q->queue_lock);
+   WARN_ON_ONCE(q->mq_ops);
 
if (unlikely(blk_queue_dead(q)))
return;
@@ -324,6 +329,7 @@ EXPORT_SYMBOL_GPL(__blk_run_queue_uncond);
 void __blk_run_queue(struct request_queue *q)
 {
lockdep_assert_held(q->queue_lock);
+   WARN_ON_ONCE(q->mq_ops);
 
if (unlikely(blk_queue_stopped(q)))
return;
@@ -348,6 +354,7 @@ EXPORT_SYMBOL(__blk_run_queue);
 void blk_run_queue_async(struct request_queue *q)
 {
lockdep_assert_held(q->queue_lock);
+   WARN_ON_ONCE(q->mq_ops);
 
if (likely(!blk_queue_stopped(q) && !blk_queue_dead(q)))
mod_delayed_work(kblockd_workqueue, >delay_work, 0);
@@ -366,6 +373,8 @@ void blk_run_queue(struct request_queue *q)
 {
unsigned long flags;
 
+   WARN_ON_ONCE(q->mq_ops);
+
spin_lock_irqsave(q->queue_lock, flags);
__blk_run_queue(q);
spin_unlock_irqrestore(q->queue_lock, flags);
@@ -394,6 +403,7 @@ static void __blk_drain_queue(struct request_queue *q, bool 
drain_all)
int i;
 
lockdep_assert_held(q->queue_lock);
+   WARN_ON_ONCE(q->mq_ops);
 
while (true) {
bool drain = false;
@@ -472,6 +482,8 @@ static void __blk_drain_queue(struct request_queue *q, bool 
drain_all)
  */
 void blk_queue_bypass_start(struct request_queue *q)
 {
+   WARN_ON_ONCE(q->mq_ops);
+
spin_lock_irq(q->queue_lock);
q->bypass_depth++;
queue_flag_set(QUEUE_FLAG_BYPASS, q);
@@ -498,6 +510,9 @@ EXPORT_SYMBOL_GPL(blk_queue_bypass_start);
  * @q: queue of interest
  *
  * Leave bypass mode and restore the normal queueing behavior.
+ *
+ * Note: although blk_queue_bypass_start() is only called for blk-sq queues,
+ * this function is called for both blk-sq and blk-mq queues.
  */
 void blk_queue_bypass_end(struct request_queue *q)
 {
@@ -895,6 +910,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio);
 
 int blk_init_allocated_queue(struct request_queue *q)
 {
+   WARN_ON_ONCE(q->mq_ops);
+
q->fq = blk_alloc_flush_queue(q, NUMA_NO_NODE, q->cmd_size);
if (!q->fq)
return -ENOMEM;
@@ -1032,6 +1049,8 @@ int blk_update_nr_requests(struct request_queue *q, 
unsigned int nr)
struct request_list *rl;
int on_thresh, off_thresh;
 
+   WARN_ON_ONCE(q->mq_ops);
+
spin_lock_irq(q->queue_lock);
q->nr_requests = nr;
blk_queue_congestion_threshold(q);
@@ -1270,6 +1289,7 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
struct request *rq;
 
lockdep_assert_held(q->queue_lock);
+   WARN_ON_ONCE(q->mq_ops);
 
rl = blk_get_rl(q, bio);/* transferred to @rq on success */
 retry:
@@ -1309,6 +1329,8 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
 {
struct request *rq;
 
+   WARN_ON_ONCE(q->mq_ops);
+
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
@@ -1358,6 +1380,7 @@

[PATCH v2 09/12] blk-mq: Document locking assumptions

2017-05-31 Thread Bart Van Assche

Document the locking assumptions in functions that modify
blk_mq_ctx.rq_list to make it easier for humans to verify
this code.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Omar Sandoval 
Cc: Ming Lei 
---
 block/blk-mq-sched.c | 2 ++
 block/blk-mq.c   | 4 
 2 files changed, 6 insertions(+)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index c4e2afb9d12d..88aa460b2e8a 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -232,6 +232,8 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
struct request *rq;
int checked = 8;
 
+   lockdep_assert_held(>lock);
+
list_for_each_entry_reverse(rq, >rq_list, queuelist) {
bool merged = false;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 488c6ca2ad91..b56cb3d9060f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1274,6 +1274,8 @@ static inline void __blk_mq_insert_req_list(struct 
blk_mq_hw_ctx *hctx,
 {
struct blk_mq_ctx *ctx = rq->mq_ctx;
 
+   lockdep_assert_held(>lock);
+
trace_block_rq_insert(hctx->queue, rq);
 
if (at_head)
@@ -1287,6 +1289,8 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, 
struct request *rq,
 {
struct blk_mq_ctx *ctx = rq->mq_ctx;
 
+   lockdep_assert_held(>lock);
+
__blk_mq_insert_req_list(hctx, rq, at_head);
blk_mq_hctx_mark_pending(hctx, ctx);
 }
-- 
2.12.2

[PATCH v2 00/12] More patches for kernel v4.13

2017-05-31 Thread Bart Van Assche

Hello Jens,

The changes compared to v1 of this patch series are:
* Addressed Christoph's comment about moving the .initialize_rq_fn() call
  from blk_rq_init() / blk_mq_rq_ctx_init() into blk_get_request().
* Left out patch "scsi: Make scsi_ioctl_reset() pass the request queue pointer
  to blk_rq_init()" since it's no longer needed.
* Restored the scsi_req_init() call in ide_prep_sense().
* Combined the two patches that reduce the blk_mq_hw_ctx size into a single
  patch.
* Modified patch "blk-mq: Initialize a request before assigning a tag" such
  that .tag and .internal_tag are no longer initialized twice.
* Removed WARN_ON_ONCE(q->mq_ops) from blk_queue_bypass_end() because this
  function is used by both blk-sq and blk-mq.
* Added several new patches, e.g. "block: Rename blk_mq_rq_{to,from}_pdu()".

Please consider these patches for kernel v4.13.

Thanks,

Bart.

Bart Van Assche (12):
  block: Make request operation type argument declarations consistent
  block: Introduce request_queue.initialize_rq_fn()
  block: Make most scsi_req_init() calls implicit
  block: Change argument type of scsi_req_init()
  blk-mq: Initialize a request before assigning a tag
  block: Add a comment above queue_lockdep_assert_held()
  block: Check locking assumptions at runtime
  block: Document what queue type each function is intended for
  blk-mq: Document locking assumptions
  block: Constify disk_type
  blk-mq: Warn when attempting to run a hardware queue that is not
mapped
  block: Rename blk_mq_rq_{to,from}_pdu()

 block/blk-core.c   | 124 -
 block/blk-flush.c  |   8 ++-
 block/blk-merge.c  |   3 +
 block/blk-mq-sched.c   |   2 +
 block/blk-mq.c |  30 +
 block/blk-tag.c|  15 ++---
 block/blk-timeout.c|   4 +-
 block/bsg.c|   1 -
 block/genhd.c  |   4 +-
 block/scsi_ioctl.c |  13 ++--
 drivers/block/loop.c   |   8 +--
 drivers/block/mtip32xx/mtip32xx.c  |  28 -
 drivers/block/nbd.c|  18 +++---
 drivers/block/null_blk.c   |   4 +-
 drivers/block/pktcdvd.c|   1 -
 drivers/block/rbd.c|   6 +-
 drivers/block/virtio_blk.c |  12 ++--
 drivers/block/xen-blkfront.c   |   2 +-
 drivers/cdrom/cdrom.c  |   1 -
 drivers/ide/ide-atapi.c|   3 +-
 drivers/ide/ide-cd.c   |   1 -
 drivers/ide/ide-cd_ioctl.c |   1 -
 drivers/ide/ide-devsets.c  |   1 -
 drivers/ide/ide-disk.c |   1 -
 drivers/ide/ide-ioctls.c   |   2 -
 drivers/ide/ide-park.c |   2 -
 drivers/ide/ide-pm.c   |   2 -
 drivers/ide/ide-probe.c|   8 +--
 drivers/ide/ide-tape.c |   1 -
 drivers/ide/ide-taskfile.c |   1 -
 drivers/md/dm-rq.c |   6 +-
 drivers/mtd/ubi/block.c|   8 +--
 drivers/nvme/host/fc.c |  20 +++---
 drivers/nvme/host/nvme.h   |   2 +-
 drivers/nvme/host/pci.c|  22 +++
 drivers/nvme/host/rdma.c   |  18 +++---
 drivers/nvme/target/loop.c |  10 +--
 drivers/scsi/osd/osd_initiator.c   |   2 -
 drivers/scsi/osst.c|   1 -
 drivers/scsi/scsi_error.c  |   1 -
 drivers/scsi/scsi_lib.c|  28 ++---
 drivers/scsi/scsi_transport_sas.c  |   6 ++
 drivers/scsi/sg.c  |   2 -
 drivers/scsi/st.c  |   1 -
 drivers/target/target_core_pscsi.c |   2 -
 fs/nfsd/blocklayout.c  |   1 -
 include/linux/blk-mq.h |  19 +-
 include/linux/blkdev.h |  27 +++-
 include/linux/ide.h|   2 +-
 include/scsi/scsi_request.h|   4 +-
 50 files changed, 284 insertions(+), 205 deletions(-)

-- 
2.12.2

[PATCH v2 10/12] block: Constify disk_type

2017-05-31 Thread Bart Van Assche

The variable 'disk_type' is never modified so constify it.

Signed-off-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Omar Sandoval 
Cc: Ming Lei 
---
 block/genhd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index d252d29fe837..7f520fa25d16 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -36,7 +36,7 @@ struct kobject *block_depr;
 static DEFINE_SPINLOCK(ext_devt_lock);
 static DEFINE_IDR(ext_devt_idr);
 
-static struct device_type disk_type;
+static const struct device_type disk_type;
 
 static void disk_check_events(struct disk_events *ev,
  unsigned int *clearing_ptr);
@@ -1183,7 +1183,7 @@ static char *block_devnode(struct device *dev, umode_t 
*mode,
return NULL;
 }
 
-static struct device_type disk_type = {
+static const struct device_type disk_type = {
.name   = "disk",
.groups = disk_attr_groups,
.release= disk_release,
-- 
2.12.2

[PATCH v2 03/12] block: Make most scsi_req_init() calls implicit

2017-05-31 Thread Bart Van Assche

Instead of explicitly calling scsi_req_init() after blk_get_request(),
call that function from inside blk_get_request(). Add an
.initialize_rq_fn() callback function to the block drivers that need
it. Merge the IDE .init_rq_fn() function into .initialize_rq_fn()
because it is too small to keep it as a separate function. Keep the
scsi_req_init() call in ide_prep_sense() because it follows a
blk_rq_init() call.

References: commit 82ed4db499b8 ("block: split scsi_request out of struct 
request")
Signed-off-by: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Christoph Hellwig 
Cc: Omar Sandoval 
Cc: Nicholas Bellinger 
---
 block/bsg.c|  1 -
 block/scsi_ioctl.c |  3 ---
 drivers/block/pktcdvd.c|  1 -
 drivers/cdrom/cdrom.c  |  1 -
 drivers/ide/ide-atapi.c|  1 -
 drivers/ide/ide-cd.c   |  1 -
 drivers/ide/ide-cd_ioctl.c |  1 -
 drivers/ide/ide-devsets.c  |  1 -
 drivers/ide/ide-disk.c |  1 -
 drivers/ide/ide-ioctls.c   |  2 --
 drivers/ide/ide-park.c |  2 --
 drivers/ide/ide-pm.c   |  2 --
 drivers/ide/ide-probe.c|  6 +++---
 drivers/ide/ide-tape.c |  1 -
 drivers/ide/ide-taskfile.c |  1 -
 drivers/scsi/osd/osd_initiator.c   |  2 --
 drivers/scsi/osst.c|  1 -
 drivers/scsi/scsi_error.c  |  1 -
 drivers/scsi/scsi_lib.c| 10 +-
 drivers/scsi/scsi_transport_sas.c  |  6 ++
 drivers/scsi/sg.c  |  2 --
 drivers/scsi/st.c  |  1 -
 drivers/target/target_core_pscsi.c |  2 --
 fs/nfsd/blocklayout.c  |  1 -
 24 files changed, 18 insertions(+), 33 deletions(-)

diff --git a/block/bsg.c b/block/bsg.c
index 40db8ff4c618..84ec1b19d516 100644
--- a/block/bsg.c
+++ b/block/bsg.c
@@ -236,7 +236,6 @@ bsg_map_hdr(struct bsg_device *bd, struct sg_io_v4 *hdr, 
fmode_t has_write_perm)
rq = blk_get_request(q, op, GFP_KERNEL);
if (IS_ERR(rq))
return rq;
-   scsi_req_init(rq);
 
ret = blk_fill_sgv4_hdr_rq(q, rq, hdr, bd, has_write_perm);
if (ret)
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index 4a294a5f7fab..f96c51f5df40 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -326,7 +326,6 @@ static int sg_io(struct request_queue *q, struct gendisk 
*bd_disk,
if (IS_ERR(rq))
return PTR_ERR(rq);
req = scsi_req(rq);
-   scsi_req_init(rq);
 
if (hdr->cmd_len > BLK_MAX_CDB) {
req->cmd = kzalloc(hdr->cmd_len, GFP_KERNEL);
@@ -456,7 +455,6 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk 
*disk, fmode_t mode,
goto error_free_buffer;
}
req = scsi_req(rq);
-   scsi_req_init(rq);
 
cmdlen = COMMAND_SIZE(opcode);
 
@@ -542,7 +540,6 @@ static int __blk_send_generic(struct request_queue *q, 
struct gendisk *bd_disk,
rq = blk_get_request(q, REQ_OP_SCSI_OUT, __GFP_RECLAIM);
if (IS_ERR(rq))
return PTR_ERR(rq);
-   scsi_req_init(rq);
rq->timeout = BLK_DEFAULT_SG_TIMEOUT;
scsi_req(rq)->cmd[0] = cmd;
scsi_req(rq)->cmd[4] = data;
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 42e3c880a8a5..2ea332c9438a 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -707,7 +707,6 @@ static int pkt_generic_packet(struct pktcdvd_device *pd, 
struct packet_command *
 REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
if (IS_ERR(rq))
return PTR_ERR(rq);
-   scsi_req_init(rq);
 
if (cgc->buflen) {
ret = blk_rq_map_kern(q, rq, cgc->buffer, cgc->buflen,
diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c
index ff19cfc587f0..e36d160c458f 100644
--- a/drivers/cdrom/cdrom.c
+++ b/drivers/cdrom/cdrom.c
@@ -2201,7 +2201,6 @@ static int cdrom_read_cdda_bpc(struct cdrom_device_info 
*cdi, __u8 __user *ubuf,
break;
}
req = scsi_req(rq);
-   scsi_req_init(rq);
 
ret = blk_rq_map_user(q, rq, NULL, ubuf, len, GFP_KERNEL);
if (ret) {
diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
index 5901937284e7..98e78b520417 100644
--- a/drivers/ide/ide-atapi.c
+++ b/drivers/ide/ide-atapi.c
@@ -93,7 +93,6 @@ int ide_queue_pc_tail(ide_drive_t *drive, struct gendisk 
*disk,
int error;
 
rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
-   scsi_req_init(rq);
ide_req(rq)->type = ATA_PRIV_MISC;
rq->special = (char *)pc;
 
diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 07e5ff3a64c3..a14ccb34c923 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -438,7 +438,6 @@ int ide_cd_queue_pc(ide_drive_t

[PATCH v2 02/12] block: Introduce request_queue.initialize_rq_fn()

2017-05-31 Thread Bart Van Assche

Several block drivers need to initialize the driver-private data
after having called blk_get_request() and before .prep_rq_fn() is
called, e.g. when submitting a REQ_OP_SCSI_* request. Avoid that
that initialization code has to be repeated after every
blk_get_request() call by adding a new callback function to struct
request_queue.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Omar Sandoval 
---
 block/blk-core.c   | 11 +--
 include/linux/blkdev.h |  4 
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3bc431a77309..3f68bc1f044c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1308,12 +1308,19 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
 struct request *blk_get_request(struct request_queue *q, unsigned int op,
gfp_t gfp_mask)
 {
+   struct request *req;
+
if (q->mq_ops)
-   return blk_mq_alloc_request(q, op,
+   req = blk_mq_alloc_request(q, op,
(gfp_mask & __GFP_DIRECT_RECLAIM) ?
0 : BLK_MQ_REQ_NOWAIT);
else
-   return blk_old_get_request(q, op, gfp_mask);
+   req = blk_old_get_request(q, op, gfp_mask);
+
+   if (!IS_ERR(req) && q->initialize_rq_fn)
+   q->initialize_rq_fn(req);
+
+   return req;
 }
 EXPORT_SYMBOL(blk_get_request);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6c4235018b49..cbc0028290e4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -410,8 +410,12 @@ struct request_queue {
rq_timed_out_fn *rq_timed_out_fn;
dma_drain_needed_fn *dma_drain_needed;
lld_busy_fn *lld_busy_fn;
+   /* Called just after a request is allocated */
init_rq_fn  *init_rq_fn;
+   /* Called just before a request is freed */
exit_rq_fn  *exit_rq_fn;
+   /* Called from inside blk_get_request() */
+   void (*initialize_rq_fn)(struct request *rq);
 
const struct blk_mq_ops *mq_ops;
 
-- 
2.12.2

Re: [PATCH v5 00/17] fs: introduce new writeback error reporting and convert ext2 and ext4 to use it

2017-05-31 Thread Jeff Layton

On Wed, 2017-05-31 at 14:37 -0700, Andrew Morton wrote:
> On Wed, 31 May 2017 17:31:49 -0400 Jeff Layton  wrote:
> 
> > On Wed, 2017-05-31 at 13:27 -0700, Andrew Morton wrote:
> > > On Wed, 31 May 2017 08:45:23 -0400 Jeff Layton  wrote:
> > > 
> > > > This is v5 of the patchset to improve how we're tracking and reporting
> > > > errors that occur during pagecache writeback.
> > > 
> > > I'm curious to know how you've been testing this?
> > >  Is that testing
> > > strong enough for us to be confident that all nature of I/O errors
> > > will be reported to userspace?
> > > 
> > 
> > That's a tall order. This is a difficult thing to test as these sorts of
> > errors are pretty rare by nature.
> > 
> > I have an xfstest that I posted just after this set that demonstrates
> > that it works correctly, at least on ext2/3/4 when run by the ext4
> > driver (ext2 legacy driver reports too many errors currently). I had
> > btrfs and xfs working on that test too in an earlier incarnation of this
> > set, so I think we can fix this in them as well without too much
> > difficulty.
> > 
> > I'm happy to run other tests if someone wants to suggest them.
> > 
> > Now, all that said, I don't think this will make things any worse than
> > they are today as far as reporting errors properly to userland goes.
> > It's rather easy for an incidental synchronous writeback request from an
> > internal caller to clear the AS_* flags today. This will at least ensure
> > that we're reporting errors since a well-defined point in time when you
> > call fsync.
> 
> Were you using error injection of some form?  If so, how was that all
> set up?
> 

Yes, it uses dm-error for fault injection.

The test basically does:

1) set up a dm-error device in a working configuration

2) build a scratch filesystem on it, with the log on a different device
in some fashion so metadata writeback will still succeed.

3) open the same file several times

4) flip dm-error device to non-working mode

5) write to each fd

6) fsync each fd

...do you get back an error on each fsync?

It then does a bit more to make sure they're cleared afterward as you'd
expect. That works for most block device based filesystems. I also have
a second xfstest that opens a block device and does the same basic
thing. That also works correctly with this patch series.

I still need to come up with a way to simulate errors on other fs'
though. We may need to plumb in some kernel-level fault injection on
some fs' to do that correctly. Suggestions welcome there.

With this series though, the idea is to convert one filesystem at a
time, so I think that should help mitigate some of the risk.

-- 
Jeff Layton

Re: [PATCH v2 3/4] blk-mq-debugfs: Show busy requests

2017-05-31 Thread Bart Van Assche

On Wed, 2017-05-31 at 14:49 -0700, Eduardo Valentin wrote:
> On Wed, May 31, 2017 at 09:45:54PM +, Bart Van Assche wrote:
> > On Wed, 2017-05-31 at 14:43 -0700, Eduardo Valentin wrote:
> > > On Wed, May 31, 2017 at 02:30:49PM -0700, Bart Van Assche wrote:
> > > > +static void hctx_show_busy(struct request *rq, void *data, bool 
> > > > reserved)
> > > > +{
> > > > +   const struct show_busy_params *params = data;
> > > > +
> > > > +   if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
> > > > +   test_bit(REQ_ATOM_STARTED, >atomic_flags))
> > > > +   __blk_mq_debugfs_rq_show(params->m,
> > > > +list_entry_rq(>queuelist));
> > > > +}
> > > > +
> > > > +static int hctx_busy_show(void *data, struct seq_file *m)
> > > > +{
> > > > +   struct blk_mq_hw_ctx *hctx = data;
> > > > +   struct show_busy_params params = { .m = m, .hctx = hctx };
> > > > +
> > > > +   blk_mq_tagset_busy_iter(hctx->queue->tag_set, hctx_show_busy, 
> > > > );
> > > > +
> > > > +   return 0;
> > > > +}
> > > 
> > > Why not making the two above one single function?
> > > hctx_busy_show vs. hctx_show_busy seams a bit confusing, and I could not 
> > > see
> > > where they get reused in your patch set..
> > 
> > Hello Eduardo,
> > 
> > If I would open-code blk_mq_tagset_busy_iter() then I would be able to 
> > implement
> > the above two functions as a single function. However, 
> > blk_mq_tagset_busy_iter()
> > expects a function pointer as third argument. That's why the above 
> > functionality
> > has been split over two functions.
> 
> Yeah, my bad here. I misread the functions. But still the naming doesnt seam
> too suggestive? how about s/hctx_show_busy/hctx_busy_entry/g?

Hello Eduardo,

Since that function shows information about a single request, how about
hctx_show_busy_rq()?

Bart.

Re: [PATCH v2 3/4] blk-mq-debugfs: Show busy requests

2017-05-31 Thread Eduardo Valentin

On Wed, May 31, 2017 at 09:45:54PM +, Bart Van Assche wrote:
> On Wed, 2017-05-31 at 14:43 -0700, Eduardo Valentin wrote:
> > On Wed, May 31, 2017 at 02:30:49PM -0700, Bart Van Assche wrote:
> > > +static void hctx_show_busy(struct request *rq, void *data, bool reserved)
> > > +{
> > > + const struct show_busy_params *params = data;
> > > +
> > > + if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
> > > + test_bit(REQ_ATOM_STARTED, >atomic_flags))
> > > + __blk_mq_debugfs_rq_show(params->m,
> > > +  list_entry_rq(>queuelist));
> > > +}
> > > +
> > > +static int hctx_busy_show(void *data, struct seq_file *m)
> > > +{
> > > + struct blk_mq_hw_ctx *hctx = data;
> > > + struct show_busy_params params = { .m = m, .hctx = hctx };
> > > +
> > > + blk_mq_tagset_busy_iter(hctx->queue->tag_set, hctx_show_busy, );
> > > +
> > > + return 0;
> > > +}
> > 
> > Why not making the two above one single function?
> > hctx_busy_show vs. hctx_show_busy seams a bit confusing, and I could not see
> > where they get reused in your patch set..
> 
> Hello Eduardo,
> 
> If I would open-code blk_mq_tagset_busy_iter() then I would be able to 
> implement
> the above two functions as a single function. However, 
> blk_mq_tagset_busy_iter()
> expects a function pointer as third argument. That's why the above 
> functionality
> has been split over two functions.

Yeah, my bad here. I misread the functions. But still the naming doesnt seam
too suggestive? how about s/hctx_show_busy/hctx_busy_entry/g?

> 
> Bart.

-- 
All the best,
Eduardo Valentin

Re: [PATCH v2 3/4] blk-mq-debugfs: Show busy requests

2017-05-31 Thread Bart Van Assche

On Wed, 2017-05-31 at 14:43 -0700, Eduardo Valentin wrote:
> On Wed, May 31, 2017 at 02:30:49PM -0700, Bart Van Assche wrote:
> > +static void hctx_show_busy(struct request *rq, void *data, bool reserved)
> > +{
> > +   const struct show_busy_params *params = data;
> > +
> > +   if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
> > +   test_bit(REQ_ATOM_STARTED, >atomic_flags))
> > +   __blk_mq_debugfs_rq_show(params->m,
> > +list_entry_rq(>queuelist));
> > +}
> > +
> > +static int hctx_busy_show(void *data, struct seq_file *m)
> > +{
> > +   struct blk_mq_hw_ctx *hctx = data;
> > +   struct show_busy_params params = { .m = m, .hctx = hctx };
> > +
> > +   blk_mq_tagset_busy_iter(hctx->queue->tag_set, hctx_show_busy, );
> > +
> > +   return 0;
> > +}
> 
> Why not making the two above one single function?
> hctx_busy_show vs. hctx_show_busy seams a bit confusing, and I could not see
> where they get reused in your patch set..

Hello Eduardo,

If I would open-code blk_mq_tagset_busy_iter() then I would be able to implement
the above two functions as a single function. However, blk_mq_tagset_busy_iter()
expects a function pointer as third argument. That's why the above functionality
has been split over two functions.

Bart.

[PATCH v2 2/6] block: Introduce queue flag QUEUE_FLAG_SCSI_PASSTHROUGH

2017-05-31 Thread Bart Van Assche

>From the context where a SCSI command is submitted it is not always
possible to figure out whether or not the queue the command is
submitted to has struct scsi_request as the first member of its
private data. Hence introduce the flag QUEUE_FLAG_SCSI_PASSTHROUGH.

Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Martin K. Petersen 
Cc: Omar Sandoval 
Cc: Don Brace 
---
 block/bsg-lib.c   | 1 +
 drivers/block/cciss.c | 1 +
 drivers/ide/ide-probe.c   | 1 +
 drivers/scsi/scsi_lib.c   | 2 ++
 drivers/scsi/scsi_transport_sas.c | 1 +
 include/linux/blkdev.h| 3 +++
 6 files changed, 9 insertions(+)

diff --git a/block/bsg-lib.c b/block/bsg-lib.c
index 0a23dbba2d30..9b91daefcd9b 100644
--- a/block/bsg-lib.c
+++ b/block/bsg-lib.c
@@ -246,6 +246,7 @@ struct request_queue *bsg_setup_queue(struct device *dev, 
char *name,
q->bsg_job_size = dd_job_size;
q->bsg_job_fn = job_fn;
queue_flag_set_unlocked(QUEUE_FLAG_BIDI, q);
+   queue_flag_set_unlocked(QUEUE_FLAG_SCSI_PASSTHROUGH, q);
blk_queue_softirq_done(q, bsg_softirq_done);
blk_queue_rq_timeout(q, BLK_DEFAULT_SG_TIMEOUT);
 
diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index cd375503f7b0..3761066fe89d 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -1956,6 +1956,7 @@ static int cciss_add_disk(ctlr_info_t *h, struct gendisk 
*disk,
disk->queue->cmd_size = sizeof(struct scsi_request);
disk->queue->request_fn = do_cciss_request;
disk->queue->queue_lock = >lock;
+   queue_flag_set_unlocked(QUEUE_FLAG_SCSI_PASSTHROUGH, disk->queue);
if (blk_init_allocated_queue(disk->queue) < 0)
goto cleanup_queue;
 
diff --git a/drivers/ide/ide-probe.c b/drivers/ide/ide-probe.c
index 023562565d11..b3f85250dea9 100644
--- a/drivers/ide/ide-probe.c
+++ b/drivers/ide/ide-probe.c
@@ -773,6 +773,7 @@ static int ide_init_queue(ide_drive_t *drive)
q->request_fn = do_ide_request;
q->init_rq_fn = ide_init_rq;
q->cmd_size = sizeof(struct ide_request);
+   queue_flag_set_unlocked(QUEUE_FLAG_SCSI_PASSTHROUGH, q);
if (blk_init_allocated_queue(q) < 0) {
blk_cleanup_queue(q);
return 1;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 99e16ac479e3..884aaa84c2dd 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -2057,6 +2057,8 @@ void __scsi_init_queue(struct Scsi_Host *shost, struct 
request_queue *q)
 {
struct device *dev = shost->dma_dev;
 
+   queue_flag_set_unlocked(QUEUE_FLAG_SCSI_PASSTHROUGH, q);
+
/*
 * this limit is imposed by hardware restrictions
 */
diff --git a/drivers/scsi/scsi_transport_sas.c 
b/drivers/scsi/scsi_transport_sas.c
index 0ebe2f1bb908..d16414bfe2ef 100644
--- a/drivers/scsi/scsi_transport_sas.c
+++ b/drivers/scsi/scsi_transport_sas.c
@@ -264,6 +264,7 @@ static int sas_bsg_initialize(struct Scsi_Host *shost, 
struct sas_rphy *rphy)
q->queuedata = shost;
 
queue_flag_set_unlocked(QUEUE_FLAG_BIDI, q);
+   queue_flag_set_unlocked(QUEUE_FLAG_SCSI_PASSTHROUGH, q);
return 0;
 
 out_cleanup_queue:
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ab92c4ea138b..019f18c65098 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -618,6 +618,7 @@ struct request_queue {
 #define QUEUE_FLAG_STATS   27  /* track rq completion times */
 #define QUEUE_FLAG_POLL_STATS  28  /* collecting stats for hybrid polling 
*/
 #define QUEUE_FLAG_REGISTERED  29  /* queue has been registered to a disk 
*/
+#define QUEUE_FLAG_SCSI_PASSTHROUGH 30 /* queue supports SCSI commands */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_STACKABLE)|   \
@@ -708,6 +709,8 @@ static inline void queue_flag_clear(unsigned int flag, 
struct request_queue *q)
 #define blk_queue_secure_erase(q) \
(test_bit(QUEUE_FLAG_SECERASE, &(q)->queue_flags))
 #define blk_queue_dax(q)   test_bit(QUEUE_FLAG_DAX, &(q)->queue_flags)
+#define blk_queue_scsi_passthrough(q)  \
+   test_bit(QUEUE_FLAG_SCSI_PASSTHROUGH, &(q)->queue_flags)
 
 #define blk_noretry_request(rq) \
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
-- 
2.12.2

Re: [PATCH v2 3/4] blk-mq-debugfs: Show busy requests

2017-05-31 Thread Eduardo Valentin

Hello,

On Wed, May 31, 2017 at 02:30:49PM -0700, Bart Van Assche wrote:
> Requests that got stuck in a block driver are neither on
> blk_mq_ctx.rq_list nor on any hw dispatch queue. Make these
> visible in debugfs through the "busy" attribute.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Hannes Reinecke 
> Cc: Omar Sandoval 
> Cc: Ming Lei 
> ---
>  block/blk-mq-debugfs.c | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 8b06a12c1461..fa0f624dfccd 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -370,6 +370,31 @@ static const struct seq_operations hctx_dispatch_seq_ops 
> = {
>   .show   = blk_mq_debugfs_rq_show,
>  };
>  
> +struct show_busy_params {
> + struct seq_file *m;
> + struct blk_mq_hw_ctx*hctx;
> +};
> +
> +static void hctx_show_busy(struct request *rq, void *data, bool reserved)
> +{
> + const struct show_busy_params *params = data;
> +
> + if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
> + test_bit(REQ_ATOM_STARTED, >atomic_flags))
> + __blk_mq_debugfs_rq_show(params->m,
> +  list_entry_rq(>queuelist));
> +}
> +
> +static int hctx_busy_show(void *data, struct seq_file *m)
> +{
> + struct blk_mq_hw_ctx *hctx = data;
> + struct show_busy_params params = { .m = m, .hctx = hctx };
> +
> + blk_mq_tagset_busy_iter(hctx->queue->tag_set, hctx_show_busy, );
> +
> + return 0;
> +}

Why not making the two above one single function?
hctx_busy_show vs. hctx_show_busy seams a bit confusing, and I could not see
where they get reused in your patch set..


> +
>  static int hctx_ctx_map_show(void *data, struct seq_file *m)
>  {
>   struct blk_mq_hw_ctx *hctx = data;
> @@ -705,6 +730,7 @@ static const struct blk_mq_debugfs_attr 
> blk_mq_debugfs_hctx_attrs[] = {
>   {"state", 0400, hctx_state_show},
>   {"flags", 0400, hctx_flags_show},
>   {"dispatch", 0400, .seq_ops = _dispatch_seq_ops},
> + {"busy", 0400, hctx_busy_show},
>   {"ctx_map", 0400, hctx_ctx_map_show},
>   {"tags", 0400, hctx_tags_show},
>   {"tags_bitmap", 0400, hctx_tags_bitmap_show},
> -- 
> 2.12.2
> 
> 

-- 
All the best,
Eduardo Valentin

[PATCH v2 0/6] Split scsi passthrough fields out of struct request sequel

2017-05-31 Thread Bart Van Assche

Hello Jens,

The patches in this series are a sequel of Christoph's "Split scsi passthrough
fields out of struct request" patch series. The changes compared to v1 of this
patch series are:
- Renamed QUEUE_FLAG_SCSI_PDU into QUEUE_FLAG_SCSI_PASSTHROUGH and
  blk_queue_scsi_pdu() into blk_queue_scsi_passthrough().
- In the cdrom driver, moved the SCSI passthrough support test from
  register_cdrom() to cdrom_read_cdda_bpc().

Please consider these patches for kernel v4.13.

Thanks,

Bart.

Bart Van Assche (6):
  block: Avoid that blk_exit_rl() triggers a use-after-free
  block: Introduce queue flag QUEUE_FLAG_SCSI_PASSTHROUGH
  bsg: Check queue type before attaching to a queue
  pktcdvd: Check queue type before attaching to a queue
  cdrom: Check SCSI passthrough support before reading audio
  nfsd: Check queue type before submitting a SCSI request

 block/blk-cgroup.c|  2 +-
 block/blk-core.c  | 10 --
 block/blk-sysfs.c |  2 +-
 block/blk.h   |  2 +-
 block/bsg-lib.c   |  1 +
 block/bsg.c   |  6 ++
 drivers/block/cciss.c |  1 +
 drivers/block/pktcdvd.c   |  5 +
 drivers/cdrom/cdrom.c |  6 ++
 drivers/ide/ide-probe.c   |  1 +
 drivers/scsi/scsi_lib.c   |  2 ++
 drivers/scsi/scsi_transport_sas.c |  1 +
 fs/nfsd/blocklayout.c |  3 +++
 include/linux/blkdev.h|  3 +++
 14 files changed, 40 insertions(+), 5 deletions(-)

-- 
2.12.2

[PATCH v2 5/6] cdrom: Check SCSI passthrough support before reading audio

2017-05-31 Thread Bart Van Assche

The CDROMREADAUDIO ioctl uses SCSI passthrough when the .disk
pointer has been set in struct cdrom_device_info. Hence check
whether SCSI passthrough is supported before submitting a SCSI
command. Note: both the ide-cd and sr drivers set the disk
pointer in struct cdrom_device_info but neither the pcd nor
the gdrom driver sets that pointer.

References: commit 82ed4db499b8 ("block: split scsi_request out of struct 
request")
Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Omar Sandoval 
Cc: linux-block@vger.kernel.org
---
 drivers/cdrom/cdrom.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/cdrom/cdrom.c b/drivers/cdrom/cdrom.c
index 76c952fd9ab9..ff19cfc587f0 100644
--- a/drivers/cdrom/cdrom.c
+++ b/drivers/cdrom/cdrom.c
@@ -2178,6 +2178,12 @@ static int cdrom_read_cdda_bpc(struct cdrom_device_info 
*cdi, __u8 __user *ubuf,
if (!q)
return -ENXIO;
 
+   if (!blk_queue_scsi_passthrough(q)) {
+   WARN_ONCE(true,
+ "Attempt read CDDA info through a non-SCSI queue\n");
+   return -EINVAL;
+   }
+
cdi->last_sense = 0;
 
while (nframes) {
-- 
2.12.2

[PATCH v2 1/6] block: Avoid that blk_exit_rl() triggers a use-after-free

2017-05-31 Thread Bart Van Assche

Since the introduction of .init_rq_fn() and .exit_rq_fn() it is
essential that the memory allocated for struct request_queue
stays around until all blk_exit_rl() calls have finished. Hence
make blk_init_rl() take a reference on struct request_queue.

This patch fixes the following crash:

general protection fault:  [#2] SMP
CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G  D 4.12.0-rc2-dbg+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
task: 88013a108040 task.stack: c971c000
RIP: 0010:free_request_size+0x1a/0x30
RSP: 0018:c971fd38 EFLAGS: 00010202
RAX: 6b6b6b6b6b6b6b6b RBX: 880067362a88 RCX: 0003
RDX: 880067464178 RSI: 880067362a88 RDI: 880135ea4418
RBP: c971fd40 R08:  R09: 000100180009
R10: c971fd38 R11: 81110800 R12: 88006752d3d8
R13: 88006752d3d8 R14: 88013a108040 R15: 000a
FS:  () GS:88013fd8() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fa8ec1edb00 CR3: 000138ee8000 CR4: 001406e0
Call Trace:
 mempool_destroy.part.10+0x21/0x40
 mempool_destroy+0xe/0x10
 blk_exit_rl+0x12/0x20
 blkg_free+0x4d/0xa0
 __blkg_release_rcu+0x59/0x170
 rcu_process_callbacks+0x260/0x4e0
 __do_softirq+0x116/0x250
 smpboot_thread_fn+0x123/0x1e0
 kthread+0x109/0x140
 ret_from_fork+0x31/0x40

Fixes: commit e9c787e65c0c ("scsi: allocate scsi_cmnd structures as part of 
struct request")
Signed-off-by: Bart Van Assche 
Acked-by: Tejun Heo 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Cc: Jan Kara 
Cc:  # v4.11+
---
 block/blk-cgroup.c |  2 +-
 block/blk-core.c   | 10 --
 block/blk-sysfs.c  |  2 +-
 block/blk.h|  2 +-
 4 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 7c2947128f58..0480892e97e5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -74,7 +74,7 @@ static void blkg_free(struct blkcg_gq *blkg)
blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
 
if (blkg->blkcg != _root)
-   blk_exit_rl(>rl);
+   blk_exit_rl(blkg->q, >rl);
 
blkg_rwstat_exit(>stat_ios);
blkg_rwstat_exit(>stat_bytes);
diff --git a/block/blk-core.c b/block/blk-core.c
index c7068520794b..a7421b772d0e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -648,13 +648,19 @@ int blk_init_rl(struct request_list *rl, struct 
request_queue *q,
if (!rl->rq_pool)
return -ENOMEM;
 
+   if (rl != >root_rl)
+   WARN_ON_ONCE(!blk_get_queue(q));
+
return 0;
 }
 
-void blk_exit_rl(struct request_list *rl)
+void blk_exit_rl(struct request_queue *q, struct request_list *rl)
 {
-   if (rl->rq_pool)
+   if (rl->rq_pool) {
mempool_destroy(rl->rq_pool);
+   if (rl != >root_rl)
+   blk_put_queue(q);
+   }
 }
 
 struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 712b018e9f54..283da7fbe034 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -809,7 +809,7 @@ static void blk_release_queue(struct kobject *kobj)
 
blk_free_queue_stats(q->stats);
 
-   blk_exit_rl(>root_rl);
+   blk_exit_rl(q, >root_rl);
 
if (q->queue_tags)
__blk_queue_free_tags(q);
diff --git a/block/blk.h b/block/blk.h
index 2ed70228e44f..83c8e1100525 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -59,7 +59,7 @@ void blk_free_flush_queue(struct blk_flush_queue *q);
 
 int blk_init_rl(struct request_list *rl, struct request_queue *q,
gfp_t gfp_mask);
-void blk_exit_rl(struct request_list *rl);
+void blk_exit_rl(struct request_queue *q, struct request_list *rl);
 void blk_rq_bio_prep(struct request_queue *q, struct request *rq,
struct bio *bio);
 void blk_queue_bypass_start(struct request_queue *q);
-- 
2.12.2

[PATCH v2 4/6] pktcdvd: Check queue type before attaching to a queue

2017-05-31 Thread Bart Van Assche

Since the pktcdvd driver only supports request queues for which
struct scsi_request is the first member of their private request
data, refuse to register block layer queues for which struct
scsi_request is not the first member of the private data.

References: commit 82ed4db499b8 ("block: split scsi_request out of struct 
request")
Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Cc: Omar Sandoval 
---
 drivers/block/pktcdvd.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 205b865ebeb9..42e3c880a8a5 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2583,6 +2583,11 @@ static int pkt_new_dev(struct pktcdvd_device *pd, dev_t 
dev)
bdev = bdget(dev);
if (!bdev)
return -ENOMEM;
+   if (!blk_queue_scsi_passthrough(bdev_get_queue(bdev))) {
+   WARN_ONCE(true, "Attempt to register a non-SCSI queue\n");
+   bdput(bdev);
+   return -EINVAL;
+   }
ret = blkdev_get(bdev, FMODE_READ | FMODE_NDELAY, NULL);
if (ret)
return ret;
-- 
2.12.2

Re: [PATCH v5 00/17] fs: introduce new writeback error reporting and convert ext2 and ext4 to use it

2017-05-31 Thread Andrew Morton

On Wed, 31 May 2017 17:31:49 -0400 Jeff Layton  wrote:

> On Wed, 2017-05-31 at 13:27 -0700, Andrew Morton wrote:
> > On Wed, 31 May 2017 08:45:23 -0400 Jeff Layton  wrote:
> > 
> > > This is v5 of the patchset to improve how we're tracking and reporting
> > > errors that occur during pagecache writeback.
> > 
> > I'm curious to know how you've been testing this?
> 
> >  Is that testing
> > strong enough for us to be confident that all nature of I/O errors
> > will be reported to userspace?
> > 
> 
> That's a tall order. This is a difficult thing to test as these sorts of
> errors are pretty rare by nature.
> 
> I have an xfstest that I posted just after this set that demonstrates
> that it works correctly, at least on ext2/3/4 when run by the ext4
> driver (ext2 legacy driver reports too many errors currently). I had
> btrfs and xfs working on that test too in an earlier incarnation of this
> set, so I think we can fix this in them as well without too much
> difficulty.
> 
> I'm happy to run other tests if someone wants to suggest them.
> 
> Now, all that said, I don't think this will make things any worse than
> they are today as far as reporting errors properly to userland goes.
> It's rather easy for an incidental synchronous writeback request from an
> internal caller to clear the AS_* flags today. This will at least ensure
> that we're reporting errors since a well-defined point in time when you
> call fsync.

Were you using error injection of some form?  If so, how was that all
set up?

Re: [PATCH v5 00/17] fs: introduce new writeback error reporting and convert ext2 and ext4 to use it

2017-05-31 Thread Jeff Layton

On Wed, 2017-05-31 at 13:27 -0700, Andrew Morton wrote:
> On Wed, 31 May 2017 08:45:23 -0400 Jeff Layton  wrote:
> 
> > This is v5 of the patchset to improve how we're tracking and reporting
> > errors that occur during pagecache writeback.
> 
> I'm curious to know how you've been testing this?

>  Is that testing
> strong enough for us to be confident that all nature of I/O errors
> will be reported to userspace?
> 

That's a tall order. This is a difficult thing to test as these sorts of
errors are pretty rare by nature.

I have an xfstest that I posted just after this set that demonstrates
that it works correctly, at least on ext2/3/4 when run by the ext4
driver (ext2 legacy driver reports too many errors currently). I had
btrfs and xfs working on that test too in an earlier incarnation of this
set, so I think we can fix this in them as well without too much
difficulty.

I'm happy to run other tests if someone wants to suggest them.

Now, all that said, I don't think this will make things any worse than
they are today as far as reporting errors properly to userland goes.
It's rather easy for an incidental synchronous writeback request from an
internal caller to clear the AS_* flags today. This will at least ensure
that we're reporting errors since a well-defined point in time when you
call fsync.
-- 
Jeff Layton

[PATCH v2 2/4] blk-mq-debugfs: Show requeue list

2017-05-31 Thread Bart Van Assche

When verifying whether or not a blk-mq driver forgot to kick the
requeue list after having requeued a request it is important to
be able to verify the contents of the requeue list. Hence export
that list through debugfs.

Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Cc: Christoph Hellwig 
Cc: Omar Sandoval 
Cc: Ming Lei 
---
 block/blk-mq-debugfs.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index d56ddd7a1285..8b06a12c1461 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -308,6 +308,37 @@ int blk_mq_debugfs_rq_show(struct seq_file *m, void *v)
 }
 EXPORT_SYMBOL_GPL(blk_mq_debugfs_rq_show);
 
+static void *queue_requeue_list_start(struct seq_file *m, loff_t *pos)
+   __acquires(>requeue_lock)
+{
+   struct request_queue *q = m->private;
+
+   spin_lock_irq(>requeue_lock);
+   return seq_list_start(>requeue_list, *pos);
+}
+
+static void *queue_requeue_list_next(struct seq_file *m, void *v, loff_t *pos)
+{
+   struct request_queue *q = m->private;
+
+   return seq_list_next(v, >requeue_list, pos);
+}
+
+static void queue_requeue_list_stop(struct seq_file *m, void *v)
+   __releases(>requeue_lock)
+{
+   struct request_queue *q = m->private;
+
+   spin_unlock_irq(>requeue_lock);
+}
+
+static const struct seq_operations queue_requeue_list_seq_ops = {
+   .start  = queue_requeue_list_start,
+   .next   = queue_requeue_list_next,
+   .stop   = queue_requeue_list_stop,
+   .show   = blk_mq_debugfs_rq_show,
+};
+
 static void *hctx_dispatch_start(struct seq_file *m, loff_t *pos)
__acquires(>lock)
 {
@@ -665,6 +696,7 @@ const struct file_operations blk_mq_debugfs_fops = {
 
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
{"poll_stat", 0400, queue_poll_stat_show},
+   {"requeue_list", 0400, .seq_ops = _requeue_list_seq_ops},
{"state", 0600, queue_state_show, queue_state_write},
{},
 };
-- 
2.12.2

[PATCH v2 4/4] blk-mq-debugfs: Add 'kick' operation

2017-05-31 Thread Bart Van Assche

Running a queue causes the block layer to examine the per-CPU and
hw queues but not the requeue list. Hence add a 'kick' operation
that also examines the requeue list.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Omar Sandoval 
Cc: Ming Lei 
---
 block/blk-mq-debugfs.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index fa0f624dfccd..962c8417809d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -114,10 +114,12 @@ static ssize_t queue_state_write(void *data, const char 
__user *buf,
blk_mq_run_hw_queues(q, true);
} else if (strcmp(op, "start") == 0) {
blk_mq_start_stopped_hw_queues(q, true);
+   } else if (strcmp(op, "kick") == 0) {
+   blk_mq_kick_requeue_list(q);
} else {
pr_err("%s: unsupported operation '%s'\n", __func__, op);
 inval:
-   pr_err("%s: use either 'run' or 'start'\n", __func__);
+   pr_err("%s: use 'run', 'start' or 'kick'\n", __func__);
return -EINVAL;
}
return count;
-- 
2.12.2

[PATCH v2 3/4] blk-mq-debugfs: Show busy requests

2017-05-31 Thread Bart Van Assche

Requests that got stuck in a block driver are neither on
blk_mq_ctx.rq_list nor on any hw dispatch queue. Make these
visible in debugfs through the "busy" attribute.

Signed-off-by: Bart Van Assche 
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Omar Sandoval 
Cc: Ming Lei 
---
 block/blk-mq-debugfs.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 8b06a12c1461..fa0f624dfccd 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -370,6 +370,31 @@ static const struct seq_operations hctx_dispatch_seq_ops = 
{
.show   = blk_mq_debugfs_rq_show,
 };
 
+struct show_busy_params {
+   struct seq_file *m;
+   struct blk_mq_hw_ctx*hctx;
+};
+
+static void hctx_show_busy(struct request *rq, void *data, bool reserved)
+{
+   const struct show_busy_params *params = data;
+
+   if (blk_mq_map_queue(rq->q, rq->mq_ctx->cpu) == params->hctx &&
+   test_bit(REQ_ATOM_STARTED, >atomic_flags))
+   __blk_mq_debugfs_rq_show(params->m,
+list_entry_rq(>queuelist));
+}
+
+static int hctx_busy_show(void *data, struct seq_file *m)
+{
+   struct blk_mq_hw_ctx *hctx = data;
+   struct show_busy_params params = { .m = m, .hctx = hctx };
+
+   blk_mq_tagset_busy_iter(hctx->queue->tag_set, hctx_show_busy, );
+
+   return 0;
+}
+
 static int hctx_ctx_map_show(void *data, struct seq_file *m)
 {
struct blk_mq_hw_ctx *hctx = data;
@@ -705,6 +730,7 @@ static const struct blk_mq_debugfs_attr 
blk_mq_debugfs_hctx_attrs[] = {
{"state", 0400, hctx_state_show},
{"flags", 0400, hctx_flags_show},
{"dispatch", 0400, .seq_ops = _dispatch_seq_ops},
+   {"busy", 0400, hctx_busy_show},
{"ctx_map", 0400, hctx_ctx_map_show},
{"tags", 0400, hctx_tags_show},
{"tags_bitmap", 0400, hctx_tags_bitmap_show},
-- 
2.12.2

Re: [PATCH v5 00/17] fs: introduce new writeback error reporting and convert ext2 and ext4 to use it

2017-05-31 Thread Andrew Morton

On Wed, 31 May 2017 08:45:23 -0400 Jeff Layton  wrote:

> This is v5 of the patchset to improve how we're tracking and reporting
> errors that occur during pagecache writeback.

I'm curious to know how you've been testing this?  Is that testing
strong enough for us to be confident that all nature of I/O errors
will be reported to userspace?

Re: [PATCH 08/24] uuid: rename uuid_to_bin to uuid_parse

2017-05-31 Thread Christoph Hellwig

On Wed, May 31, 2017 at 09:14:35PM +0300, Andy Shevchenko wrote:
> On Wed, 2017-05-31 at 18:18 +0200, Christoph Hellwig wrote:
> > This matches the userspace version of it, and describes the
> > functionality
> > much better.  Also do the same for the guid version.
> > 
> 
> No objections for renaming, though I'm pretty sure it should be squashed
> to patch 6.

Fine with me, it just seemed like a unrelated enough change to justify
keeping it separate.

Re: [PATCH 15/24] block: remove blk_part_pack_uuid

2017-05-31 Thread Christoph Hellwig

On Wed, May 31, 2017 at 09:16:34PM +0300, Andy Shevchenko wrote:
> On Wed, 2017-05-31 at 18:18 +0200, Christoph Hellwig wrote:
> > This helper was only used by IMA of all things, which would get
> > spurious
> > errors if CONFIG_BLOCK is disabled.  Just opencode the call there.
> 
> > -   result = blk_part_pack_uuid(args[0].from,
> > -   entry->fsuuid);
> > +   result = uuid_to_bin(args[0].from, (uuid_t
> > *)>fsuuid);
> > 
> 
> uuid_parse() ?

Yes.  I fixed this up, but it seems like I didn't squash it into
the right patch..

Re: [PATCH 01/24] uuid,afs: move struct uuid_v1 back into afs

2017-05-31 Thread David Howells

Christoph Hellwig  wrote:

> This essentially is a partial revert of commit ff548773
> ("afs: Move UUID struct to linux/uuid.h") and moves struct uuid_v1 back into
> fs/afs as struct afs_uuid.  It however keeps it as big endian structure
> so that we can use the normal uuid generation helpers when casting to/from
> struct afs_uuid.
> 
> The V1 uuid intrepreatation in struct form isn't really useful to the
> rest of the kernel, and not really compatible to it either, so move it
> back to AFS instead of polluting the global uuid.h.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: David Howells

Re: [PATCH 15/24] block: remove blk_part_pack_uuid

2017-05-31 Thread Andy Shevchenko

On Wed, 2017-05-31 at 18:18 +0200, Christoph Hellwig wrote:
> This helper was only used by IMA of all things, which would get
> spurious
> errors if CONFIG_BLOCK is disabled.  Just opencode the call there.

> - result = blk_part_pack_uuid(args[0].from,
> - entry->fsuuid);
> + result = uuid_to_bin(args[0].from, (uuid_t
> *)>fsuuid);
> 

uuid_parse() ?

-- 
Andy Shevchenko 
Intel Finland Oy

Re: [PATCH 08/24] uuid: rename uuid_to_bin to uuid_parse

2017-05-31 Thread Andy Shevchenko

On Wed, 2017-05-31 at 18:18 +0200, Christoph Hellwig wrote:
> This matches the userspace version of it, and describes the
> functionality
> much better.  Also do the same for the guid version.
> 

No objections for renaming, though I'm pretty sure it should be squashed
to patch 6.

> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/uuid.h |  8 
>  lib/test_uuid.c  |  8 
>  lib/uuid.c   | 14 +++---
>  3 files changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/uuid.h b/include/linux/uuid.h
> index a9d0fdba5404..82b165b579f5 100644
> --- a/include/linux/uuid.h
> +++ b/include/linux/uuid.h
> @@ -45,8 +45,8 @@ bool __must_check uuid_is_valid(const char *uuid);
>  extern const u8 guid_index[16];
>  extern const u8 uuid_index[16];
>  
> -int guid_to_bin(const char *uuid, guid_t *u);
> -int uuid_to_bin(const char *uuid, uuid_t *u);
> +int guid_parse(const char *uuid, guid_t *u);
> +int uuid_parse(const char *uuid, uuid_t *u);
>  
>  /* backwards compatibility, don't use in new code */
>  typedef uuid_t uuid_be;
> @@ -58,8 +58,8 @@ typedef uuid_t uuid_be;
>  
>  #define uuid_le_gen(u)   guid_gen(u)
>  #define uuid_be_gen(u)   uuid_gen(u)
> -#define uuid_le_to_bin(guid, u)  guid_to_bin(guid, u)
> -#define uuid_be_to_bin(uuid, u)  uuid_to_bin(uuid, u)
> +#define uuid_le_to_bin(guid, u)  guid_parse(guid, u)
> +#define uuid_be_to_bin(uuid, u)  uuid_parse(uuid, u)
>  
>  static inline int uuid_le_cmp(const guid_t u1, const guid_t u2)
>  {
> diff --git a/lib/test_uuid.c b/lib/test_uuid.c
> index 9cad846fd805..edda536a7b45 100644
> --- a/lib/test_uuid.c
> +++ b/lib/test_uuid.c
> @@ -67,7 +67,7 @@ static void __init test_uuid_test(const struct
> test_uuid_data *data)
>  
>   /* LE */
>   total_tests++;
> - if (guid_to_bin(data->uuid, ))
> + if (guid_parse(data->uuid, ))
>   test_uuid_failed("conversion", false, false, data-
> >uuid, NULL);
>  
>   total_tests++;
> @@ -78,7 +78,7 @@ static void __init test_uuid_test(const struct
> test_uuid_data *data)
>  
>   /* BE */
>   total_tests++;
> - if (uuid_to_bin(data->uuid, ))
> + if (uuid_parse(data->uuid, ))
>   test_uuid_failed("conversion", false, true, data-
> >uuid, NULL);
>  
>   total_tests++;
> @@ -95,12 +95,12 @@ static void __init test_uuid_wrong(const char
> *data)
>  
>   /* LE */
>   total_tests++;
> - if (!guid_to_bin(data, ))
> + if (!guid_parse(data, ))
>   test_uuid_failed("negative", true, false, data,
> NULL);
>  
>   /* BE */
>   total_tests++;
> - if (!uuid_to_bin(data, ))
> + if (!uuid_parse(data, ))
>   test_uuid_failed("negative", true, true, data, NULL);
>  }
>  
> diff --git a/lib/uuid.c b/lib/uuid.c
> index f80dc63f6ca8..90bee73f7bd7 100644
> --- a/lib/uuid.c
> +++ b/lib/uuid.c
> @@ -97,7 +97,7 @@ bool uuid_is_valid(const char *uuid)
>  }
>  EXPORT_SYMBOL(uuid_is_valid);
>  
> -static int __uuid_to_bin(const char *uuid, __u8 b[16], const u8
> ei[16])
> +static int __uuid_parse(const char *uuid, __u8 b[16], const u8
> ei[16])
>  {
>   static const u8 si[16] =
> {0,2,4,6,9,11,14,16,19,21,24,26,28,30,32,34};
>   unsigned int i;
> @@ -115,14 +115,14 @@ static int __uuid_to_bin(const char *uuid, __u8
> b[16], const u8 ei[16])
>   return 0;
>  }
>  
> -int guid_to_bin(const char *uuid, guid_t *u)
> +int guid_parse(const char *uuid, guid_t *u)
>  {
> - return __uuid_to_bin(uuid, u->b, guid_index);
> + return __uuid_parse(uuid, u->b, guid_index);
>  }
> -EXPORT_SYMBOL(guid_to_bin);
> +EXPORT_SYMBOL(guid_parse);
>  
> -int uuid_to_bin(const char *uuid, uuid_t *u)
> +int uuid_parse(const char *uuid, uuid_t *u)
>  {
> - return __uuid_to_bin(uuid, u->b, uuid_index);
> + return __uuid_parse(uuid, u->b, uuid_index);
>  }
> -EXPORT_SYMBOL(uuid_to_bin);
> +EXPORT_SYMBOL(uuid_parse);

-- 
Andy Shevchenko 
Intel Finland Oy

[PATCH 01/24] uuid,afs: move struct uuid_v1 back into afs

2017-05-31 Thread Christoph Hellwig

This essentially is a partial revert of commit ff548773
("afs: Move UUID struct to linux/uuid.h") and moves struct uuid_v1 back into
fs/afs as struct afs_uuid.  It however keeps it as big endian structure
so that we can use the normal uuid generation helpers when casting to/from
struct afs_uuid.

The V1 uuid intrepreatation in struct form isn't really useful to the
rest of the kernel, and not really compatible to it either, so move it
back to AFS instead of polluting the global uuid.h.

Signed-off-by: Christoph Hellwig 
---
 fs/afs/cmservice.c   | 16 
 fs/afs/internal.h| 11 ++-
 fs/afs/main.c|  2 +-
 include/linux/uuid.h | 24 
 4 files changed, 19 insertions(+), 34 deletions(-)

diff --git a/fs/afs/cmservice.c b/fs/afs/cmservice.c
index 3062cceb5c2a..782d4d05a53b 100644
--- a/fs/afs/cmservice.c
+++ b/fs/afs/cmservice.c
@@ -350,7 +350,7 @@ static int afs_deliver_cb_init_call_back_state3(struct 
afs_call *call)
 {
struct sockaddr_rxrpc srx;
struct afs_server *server;
-   struct uuid_v1 *r;
+   struct afs_uuid *r;
unsigned loop;
__be32 *b;
int ret;
@@ -380,7 +380,7 @@ static int afs_deliver_cb_init_call_back_state3(struct 
afs_call *call)
}
 
_debug("unmarshall UUID");
-   call->request = kmalloc(sizeof(struct uuid_v1), GFP_KERNEL);
+   call->request = kmalloc(sizeof(struct afs_uuid), GFP_KERNEL);
if (!call->request)
return -ENOMEM;
 
@@ -453,7 +453,7 @@ static int afs_deliver_cb_probe(struct afs_call *call)
 static void SRXAFSCB_ProbeUuid(struct work_struct *work)
 {
struct afs_call *call = container_of(work, struct afs_call, work);
-   struct uuid_v1 *r = call->request;
+   struct afs_uuid *r = call->request;
 
struct {
__be32  match;
@@ -476,7 +476,7 @@ static void SRXAFSCB_ProbeUuid(struct work_struct *work)
  */
 static int afs_deliver_cb_probe_uuid(struct afs_call *call)
 {
-   struct uuid_v1 *r;
+   struct afs_uuid *r;
unsigned loop;
__be32 *b;
int ret;
@@ -502,15 +502,15 @@ static int afs_deliver_cb_probe_uuid(struct afs_call 
*call)
}
 
_debug("unmarshall UUID");
-   call->request = kmalloc(sizeof(struct uuid_v1), GFP_KERNEL);
+   call->request = kmalloc(sizeof(struct afs_uuid), GFP_KERNEL);
if (!call->request)
return -ENOMEM;
 
b = call->buffer;
r = call->request;
-   r->time_low = b[0];
-   r->time_mid = htons(ntohl(b[1]));
-   r->time_hi_and_version  = htons(ntohl(b[2]));
+   r->time_low = ntohl(b[0]);
+   r->time_mid = ntohl(b[1]);
+   r->time_hi_and_version  = ntohl(b[2]);
r->clock_seq_hi_and_reserved= ntohl(b[3]);
r->clock_seq_low= ntohl(b[4]);
 
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 393672997cc2..4e2556606623 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -410,6 +410,15 @@ struct afs_interface {
unsignedmtu;/* MTU of interface */
 };
 
+struct afs_uuid {
+   __be32  time_low;   /* low part of 
timestamp */
+   __be16  time_mid;   /* mid part of 
timestamp */
+   __be16  time_hi_and_version;/* high part of 
timestamp and version  */
+   __u8clock_seq_hi_and_reserved;  /* clock seq hi and 
variant */
+   __u8clock_seq_low;  /* clock seq low */
+   __u8node[6];/* spatially unique 
node ID (MAC addr) */
+};
+
 /*/
 /*
  * cache.c
@@ -544,7 +553,7 @@ extern int afs_drop_inode(struct inode *);
  * main.c
  */
 extern struct workqueue_struct *afs_wq;
-extern struct uuid_v1 afs_uuid;
+extern struct afs_uuid afs_uuid;
 
 /*
  * misc.c
diff --git a/fs/afs/main.c b/fs/afs/main.c
index 51d7d17bca57..9944770849da 100644
--- a/fs/afs/main.c
+++ b/fs/afs/main.c
@@ -31,7 +31,7 @@ static char *rootcell;
 module_param(rootcell, charp, 0);
 MODULE_PARM_DESC(rootcell, "root AFS cell name and VL server IP addr list");
 
-struct uuid_v1 afs_uuid;
+struct afs_uuid afs_uuid;
 struct workqueue_struct *afs_wq;
 
 /*
diff --git a/include/linux/uuid.h b/include/linux/uuid.h
index 4dff73a89758..2d095fc60204 100644
--- a/include/linux/uuid.h
+++ b/include/linux/uuid.h
@@ -19,30 +19,6 @@
 #include 
 
 /*
- * V1 (time-based) UUID definition [RFC 4122].
- * - the timestamp is a 60-bit value, split 32/16/12, and goes in 100ns
- *   increments since midnight 15th October 1582
- *   - add

cleanup UUID types V6

2017-05-31 Thread Christoph Hellwig

Hi all,

this series, which is a combined effort from Amir, Andy and me introduces
new uuid_t and guid_t type names that are less confusing than the existing
types, adds new helpers for them and starts switching the fs code over to
it.  Andy has additional patches on top to convert many of the users
that use char arrays for UUIDs and GUIDs to these (or rather a predecessor
for now until updated).

Changes since V5:
 - fix the AFS revert to respect endianess
 - rename uuid_to_bin to uuid_parse

Changes since V4:
 - removed the patch to remove uuid_be for now
 - move the md patch to the front of the queue
 - remove the union for V1 uuids and provide accessors instead
 - new patch to set s_uuid for tmpfs
 - revert the patch that moved struct uuid_v1 from afs to the core
 - implement the fsid generation differently in XFS due to the
   change above
 - add a MAINTAINERS entry

Changes since V3:
 - stop exposing uuid_be/uuid_t to userspace
 - remove uuid_be entirely

Changes since V2:
 - various cleanups

[PATCH 04/24] md: namespace private helper names

2017-05-31 Thread Christoph Hellwig

From: Amir Goldstein 

The md private helper uuid_equal() collides with a generic helper
of the same name.

Rename the md private helper to md_uuid_equal() and do the same for
md_sb_equal().

Signed-off-by: Amir Goldstein 
Signed-off-by: Christoph Hellwig 
Reviewed-by: Shaohua Li 
Reviewed-by: Andy Shevchenko 
---
 drivers/md/md.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 10367ffe92e3..b9ab268ba7f9 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -825,7 +825,7 @@ static int read_disk_sb(struct md_rdev *rdev, int size)
return -EINVAL;
 }
 
-static int uuid_equal(mdp_super_t *sb1, mdp_super_t *sb2)
+static int md_uuid_equal(mdp_super_t *sb1, mdp_super_t *sb2)
 {
return  sb1->set_uuid0 == sb2->set_uuid0 &&
sb1->set_uuid1 == sb2->set_uuid1 &&
@@ -833,7 +833,7 @@ static int uuid_equal(mdp_super_t *sb1, mdp_super_t *sb2)
sb1->set_uuid3 == sb2->set_uuid3;
 }
 
-static int sb_equal(mdp_super_t *sb1, mdp_super_t *sb2)
+static int md_sb_equal(mdp_super_t *sb1, mdp_super_t *sb2)
 {
int ret;
mdp_super_t *tmp1, *tmp2;
@@ -1025,12 +1025,12 @@ static int super_90_load(struct md_rdev *rdev, struct 
md_rdev *refdev, int minor
} else {
__u64 ev1, ev2;
mdp_super_t *refsb = page_address(refdev->sb_page);
-   if (!uuid_equal(refsb, sb)) {
+   if (!md_uuid_equal(refsb, sb)) {
pr_warn("md: %s has different UUID to %s\n",
b, bdevname(refdev->bdev,b2));
goto abort;
}
-   if (!sb_equal(refsb, sb)) {
+   if (!md_sb_equal(refsb, sb)) {
pr_warn("md: %s has same UUID but different superblock 
to %s\n",
b, bdevname(refdev->bdev, b2));
goto abort;
-- 
2.11.0

[PATCH 03/24] xfs: use uuid_be to implement the uuid_t type

2017-05-31 Thread Christoph Hellwig

Use the generic Linux definition to implement our UUID type, this will
allow using more generic infrastructure in the future.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
Reviewed-by: Brian Foster 
Reviewed-by: Andy Shevchenko 
Reviewed-by: Darrick J. Wong 
---
 fs/xfs/uuid.h  | 4 
 fs/xfs/xfs_linux.h | 3 +++
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/uuid.h b/fs/xfs/uuid.h
index 104db0f3bed6..4f1441ba4fa5 100644
--- a/fs/xfs/uuid.h
+++ b/fs/xfs/uuid.h
@@ -18,10 +18,6 @@
 #ifndef __XFS_SUPPORT_UUID_H__
 #define __XFS_SUPPORT_UUID_H__
 
-typedef struct {
-   unsigned char   __u_bits[16];
-} uuid_t;
-
 extern int uuid_is_nil(uuid_t *uuid);
 extern int uuid_equal(uuid_t *uuid1, uuid_t *uuid2);
 extern void uuid_getnodeuniq(uuid_t *uuid, int fsid [2]);
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 044fb0e15390..89ee5ec66837 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -19,6 +19,7 @@
 #define __XFS_LINUX__
 
 #include 
+#include 
 
 /*
  * Kernel specific type declarations for XFS
@@ -38,6 +39,8 @@ typedef __s64 xfs_daddr_t;/*  type */
 typedef __u32  xfs_dev_t;
 typedef __u32  xfs_nlink_t;
 
+typedef uuid_beuuid_t;
+
 #include "xfs_types.h"
 
 #include "kmem.h"
-- 
2.11.0

[PATCH 05/24] uuid: remove uuid_be defintions from the uapi header

2017-05-31 Thread Christoph Hellwig

We don't use uuid_be and the UUID_BE constants in any uapi headers, so make
them private to the kernel.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
Reviewed-by: Andy Shevchenko 
---
 include/linux/uuid.h  | 15 +++
 include/uapi/linux/uuid.h | 16 
 2 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/include/linux/uuid.h b/include/linux/uuid.h
index 2d095fc60204..30fb13018e29 100644
--- a/include/linux/uuid.h
+++ b/include/linux/uuid.h
@@ -18,6 +18,21 @@
 
 #include 
 
+typedef struct {
+   __u8 b[16];
+} uuid_be;
+
+#define UUID_BE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)   \
+((uuid_be) \
+{{ ((a) >> 24) & 0xff, ((a) >> 16) & 0xff, ((a) >> 8) & 0xff, (a) & 0xff, \
+   ((b) >> 8) & 0xff, (b) & 0xff,  \
+   ((c) >> 8) & 0xff, (c) & 0xff,  \
+   (d0), (d1), (d2), (d3), (d4), (d5), (d6), (d7) }})
+
+#define NULL_UUID_BE   \
+   UUID_BE(0x, 0x, 0x, 0x00, 0x00, 0x00, 0x00, \
+   0x00, 0x00, 0x00, 0x00)
+
 /*
  * The length of a UUID string ("----")
  * not including trailing NUL.
diff --git a/include/uapi/linux/uuid.h b/include/uapi/linux/uuid.h
index 3738e5fb6a4d..0099756c4bac 100644
--- a/include/uapi/linux/uuid.h
+++ b/include/uapi/linux/uuid.h
@@ -24,10 +24,6 @@ typedef struct {
__u8 b[16];
 } uuid_le;
 
-typedef struct {
-   __u8 b[16];
-} uuid_be;
-
 #define UUID_LE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)   \
 ((uuid_le) \
 {{ (a) & 0xff, ((a) >> 8) & 0xff, ((a) >> 16) & 0xff, ((a) >> 24) & 0xff, \
@@ -35,20 +31,8 @@ typedef struct {
(c) & 0xff, ((c) >> 8) & 0xff,  \
(d0), (d1), (d2), (d3), (d4), (d5), (d6), (d7) }})
 
-#define UUID_BE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)   \
-((uuid_be) \
-{{ ((a) >> 24) & 0xff, ((a) >> 16) & 0xff, ((a) >> 8) & 0xff, (a) & 0xff, \
-   ((b) >> 8) & 0xff, (b) & 0xff,  \
-   ((c) >> 8) & 0xff, (c) & 0xff,  \
-   (d0), (d1), (d2), (d3), (d4), (d5), (d6), (d7) }})
-
 #define NULL_UUID_LE   \
UUID_LE(0x, 0x, 0x, 0x00, 0x00, 0x00, 0x00, \
0x00, 0x00, 0x00, 0x00)
 
-#define NULL_UUID_BE   \
-   UUID_BE(0x, 0x, 0x, 0x00, 0x00, 0x00, 0x00, \
-   0x00, 0x00, 0x00, 0x00)
-
-
 #endif /* _UAPI_LINUX_UUID_H_ */
-- 
2.11.0

[PATCH 08/24] uuid: rename uuid_to_bin to uuid_parse

2017-05-31 Thread Christoph Hellwig

This matches the userspace version of it, and describes the functionality
much better.  Also do the same for the guid version.

Signed-off-by: Christoph Hellwig 
---
 include/linux/uuid.h |  8 
 lib/test_uuid.c  |  8 
 lib/uuid.c   | 14 +++---
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/uuid.h b/include/linux/uuid.h
index a9d0fdba5404..82b165b579f5 100644
--- a/include/linux/uuid.h
+++ b/include/linux/uuid.h
@@ -45,8 +45,8 @@ bool __must_check uuid_is_valid(const char *uuid);
 extern const u8 guid_index[16];
 extern const u8 uuid_index[16];
 
-int guid_to_bin(const char *uuid, guid_t *u);
-int uuid_to_bin(const char *uuid, uuid_t *u);
+int guid_parse(const char *uuid, guid_t *u);
+int uuid_parse(const char *uuid, uuid_t *u);
 
 /* backwards compatibility, don't use in new code */
 typedef uuid_t uuid_be;
@@ -58,8 +58,8 @@ typedef uuid_t uuid_be;
 
 #define uuid_le_gen(u) guid_gen(u)
 #define uuid_be_gen(u) uuid_gen(u)
-#define uuid_le_to_bin(guid, u)guid_to_bin(guid, u)
-#define uuid_be_to_bin(uuid, u)uuid_to_bin(uuid, u)
+#define uuid_le_to_bin(guid, u)guid_parse(guid, u)
+#define uuid_be_to_bin(uuid, u)uuid_parse(uuid, u)
 
 static inline int uuid_le_cmp(const guid_t u1, const guid_t u2)
 {
diff --git a/lib/test_uuid.c b/lib/test_uuid.c
index 9cad846fd805..edda536a7b45 100644
--- a/lib/test_uuid.c
+++ b/lib/test_uuid.c
@@ -67,7 +67,7 @@ static void __init test_uuid_test(const struct test_uuid_data 
*data)
 
/* LE */
total_tests++;
-   if (guid_to_bin(data->uuid, ))
+   if (guid_parse(data->uuid, ))
test_uuid_failed("conversion", false, false, data->uuid, NULL);
 
total_tests++;
@@ -78,7 +78,7 @@ static void __init test_uuid_test(const struct test_uuid_data 
*data)
 
/* BE */
total_tests++;
-   if (uuid_to_bin(data->uuid, ))
+   if (uuid_parse(data->uuid, ))
test_uuid_failed("conversion", false, true, data->uuid, NULL);
 
total_tests++;
@@ -95,12 +95,12 @@ static void __init test_uuid_wrong(const char *data)
 
/* LE */
total_tests++;
-   if (!guid_to_bin(data, ))
+   if (!guid_parse(data, ))
test_uuid_failed("negative", true, false, data, NULL);
 
/* BE */
total_tests++;
-   if (!uuid_to_bin(data, ))
+   if (!uuid_parse(data, ))
test_uuid_failed("negative", true, true, data, NULL);
 }
 
diff --git a/lib/uuid.c b/lib/uuid.c
index f80dc63f6ca8..90bee73f7bd7 100644
--- a/lib/uuid.c
+++ b/lib/uuid.c
@@ -97,7 +97,7 @@ bool uuid_is_valid(const char *uuid)
 }
 EXPORT_SYMBOL(uuid_is_valid);
 
-static int __uuid_to_bin(const char *uuid, __u8 b[16], const u8 ei[16])
+static int __uuid_parse(const char *uuid, __u8 b[16], const u8 ei[16])
 {
static const u8 si[16] = {0,2,4,6,9,11,14,16,19,21,24,26,28,30,32,34};
unsigned int i;
@@ -115,14 +115,14 @@ static int __uuid_to_bin(const char *uuid, __u8 b[16], 
const u8 ei[16])
return 0;
 }
 
-int guid_to_bin(const char *uuid, guid_t *u)
+int guid_parse(const char *uuid, guid_t *u)
 {
-   return __uuid_to_bin(uuid, u->b, guid_index);
+   return __uuid_parse(uuid, u->b, guid_index);
 }
-EXPORT_SYMBOL(guid_to_bin);
+EXPORT_SYMBOL(guid_parse);
 
-int uuid_to_bin(const char *uuid, uuid_t *u)
+int uuid_parse(const char *uuid, uuid_t *u)
 {
-   return __uuid_to_bin(uuid, u->b, uuid_index);
+   return __uuid_parse(uuid, u->b, uuid_index);
 }
-EXPORT_SYMBOL(uuid_to_bin);
+EXPORT_SYMBOL(uuid_parse);
-- 
2.11.0

[PATCH 06/24] uuid: rename uuid types

2017-05-31 Thread Christoph Hellwig

Our "little endian" UUID really is a Wintel GUID, so rename it and its
helpers such (guid_t).  The big endian UUID is the only true one, so
give it the name uuid_t.  The uuid_le and uuid_be names are retained for
now, but will hopefully go away soon.  The exception to that are the _cmp
helpers that will be replaced by better primitives ASAP and thus don't
get the new names.

Also remove the existing typedef in XFS that's now been superceeded by
the generic type name.

Signed-off-by: Christoph Hellwig 
[andy: also update the UUID_LE/UUID_BE macros including fallout]
Signed-off-by: Andy Shevchenko 
Reviewed-by: Amir Goldstein 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Andy Shevchenko 
---
 fs/xfs/xfs_linux.h|  2 --
 include/linux/uuid.h  | 55 +++
 include/uapi/linux/uuid.h | 12 +++
 lib/test_uuid.c   | 32 +--
 lib/uuid.c| 28 
 lib/vsprintf.c|  4 ++--
 6 files changed, 72 insertions(+), 61 deletions(-)

diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 89ee5ec66837..2c33d915e550 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -39,8 +39,6 @@ typedef __s64 xfs_daddr_t;/*  type */
 typedef __u32  xfs_dev_t;
 typedef __u32  xfs_nlink_t;
 
-typedef uuid_beuuid_t;
-
 #include "xfs_types.h"
 
 #include "kmem.h"
diff --git a/include/linux/uuid.h b/include/linux/uuid.h
index 30fb13018e29..a9d0fdba5404 100644
--- a/include/linux/uuid.h
+++ b/include/linux/uuid.h
@@ -20,46 +20,55 @@
 
 typedef struct {
__u8 b[16];
-} uuid_be;
+} uuid_t;
 
-#define UUID_BE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)   \
-((uuid_be) \
+#define UUID(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)  \
+((uuid_t)  \
 {{ ((a) >> 24) & 0xff, ((a) >> 16) & 0xff, ((a) >> 8) & 0xff, (a) & 0xff, \
((b) >> 8) & 0xff, (b) & 0xff,  \
((c) >> 8) & 0xff, (c) & 0xff,  \
(d0), (d1), (d2), (d3), (d4), (d5), (d6), (d7) }})
 
-#define NULL_UUID_BE   \
-   UUID_BE(0x, 0x, 0x, 0x00, 0x00, 0x00, 0x00, \
-   0x00, 0x00, 0x00, 0x00)
-
 /*
  * The length of a UUID string ("----")
  * not including trailing NUL.
  */
 #defineUUID_STRING_LEN 36
 
-static inline int uuid_le_cmp(const uuid_le u1, const uuid_le u2)
-{
-   return memcmp(, , sizeof(uuid_le));
-}
-
-static inline int uuid_be_cmp(const uuid_be u1, const uuid_be u2)
-{
-   return memcmp(, , sizeof(uuid_be));
-}
-
 void generate_random_uuid(unsigned char uuid[16]);
 
-extern void uuid_le_gen(uuid_le *u);
-extern void uuid_be_gen(uuid_be *u);
+extern void guid_gen(guid_t *u);
+extern void uuid_gen(uuid_t *u);
 
 bool __must_check uuid_is_valid(const char *uuid);
 
-extern const u8 uuid_le_index[16];
-extern const u8 uuid_be_index[16];
+extern const u8 guid_index[16];
+extern const u8 uuid_index[16];
+
+int guid_to_bin(const char *uuid, guid_t *u);
+int uuid_to_bin(const char *uuid, uuid_t *u);
 
-int uuid_le_to_bin(const char *uuid, uuid_le *u);
-int uuid_be_to_bin(const char *uuid, uuid_be *u);
+/* backwards compatibility, don't use in new code */
+typedef uuid_t uuid_be;
+#define UUID_BE(a, _b, c, d0, d1, d2, d3, d4, d5, d6, d7) \
+   UUID(a, _b, c, d0, d1, d2, d3, d4, d5, d6, d7)
+#define NULL_UUID_BE   \
+   UUID_BE(0x, 0x, 0x, 0x00, 0x00, 0x00, 0x00, \
+0x00, 0x00, 0x00, 0x00)
+
+#define uuid_le_gen(u) guid_gen(u)
+#define uuid_be_gen(u) uuid_gen(u)
+#define uuid_le_to_bin(guid, u)guid_to_bin(guid, u)
+#define uuid_be_to_bin(uuid, u)uuid_to_bin(uuid, u)
+
+static inline int uuid_le_cmp(const guid_t u1, const guid_t u2)
+{
+   return memcmp(, , sizeof(guid_t));
+}
+
+static inline int uuid_be_cmp(const uuid_t u1, const uuid_t u2)
+{
+   return memcmp(, , sizeof(uuid_t));
+}
 
 #endif
diff --git a/include/uapi/linux/uuid.h b/include/uapi/linux/uuid.h
index 0099756c4bac..1eeeca973315 100644
--- a/include/uapi/linux/uuid.h
+++ b/include/uapi/linux/uuid.h
@@ -22,17 +22,21 @@
 
 typedef struct {
__u8 b[16];
-} uuid_le;
+} guid_t;
 
-#define UUID_LE(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)   \
-((uuid_le) \
+#define GUID(a, b, c, d0, d1, d2, d3, d4, d5, d6, d7)  \
+((guid_t)  \
 {{ (a) & 0xff, ((a) >>

[PATCH 07/24] nfsd: namespace-prefix uuid_parse

2017-05-31 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
---
 fs/nfsd/export.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
index e71f11b1a180..3bc08c394a3f 100644
--- a/fs/nfsd/export.c
+++ b/fs/nfsd/export.c
@@ -486,7 +486,7 @@ secinfo_parse(char **mesg, char *buf, struct svc_export 
*exp) { return 0; }
 #endif
 
 static inline int
-uuid_parse(char **mesg, char *buf, unsigned char **puuid)
+nfsd_uuid_parse(char **mesg, char *buf, unsigned char **puuid)
 {
int len;
 
@@ -586,7 +586,7 @@ static int svc_export_parse(struct cache_detail *cd, char 
*mesg, int mlen)
if (strcmp(buf, "fsloc") == 0)
err = fsloc_parse(, buf, _fslocs);
else if (strcmp(buf, "uuid") == 0)
-   err = uuid_parse(, buf, _uuid);
+   err = nfsd_uuid_parse(, buf, _uuid);
else if (strcmp(buf, "secinfo") == 0)
err = secinfo_parse(, buf, );
else
-- 
2.11.0

[PATCH 09/24] uuid: don't export guid_index and uuid_index

2017-05-31 Thread Christoph Hellwig

These are only used in uuid.c and vsprintf.c and aren't something modules
should use directly.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
Reviewed-by: Andy Shevchenko 
---
 lib/uuid.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/lib/uuid.c b/lib/uuid.c
index 90bee73f7bd7..f7116ed88e01 100644
--- a/lib/uuid.c
+++ b/lib/uuid.c
@@ -22,9 +22,7 @@
 #include 
 
 const u8 guid_index[16] = {3,2,1,0,5,4,7,6,8,9,10,11,12,13,14,15};
-EXPORT_SYMBOL(guid_index);
 const u8 uuid_index[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
-EXPORT_SYMBOL(uuid_index);
 
 /***
  * Random UUID interface
-- 
2.11.0

[PATCH 11/24] uuid: hoist uuid_is_null() helper from libnvdimm

2017-05-31 Thread Christoph Hellwig

Hoist the libnvdimm helper as an inline helper to linux/uuid.h
using an auxiliary const variable uuid_null in lib/uuid.c.

[hch: also add the guid variant.  Both do the same but I'd like
to keep casts to a minimum]

The common helper uses the new abstract type uuid_t * instead of
u8 *.

Suggested-by: Christoph Hellwig 
Signed-off-by: Amir Goldstein 
[hch: added guid_is_null]
Signed-off-by: Christoph Hellwig 
Acked-by: Dan Williams 
Reviewed-by: Andy Shevchenko 
---
 drivers/nvdimm/btt_devs.c |  9 +
 include/linux/uuid.h  | 13 +
 lib/uuid.c|  5 +
 3 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index ae00dc0d9791..4c989bb9a8a0 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -222,13 +222,6 @@ struct device *nd_btt_create(struct nd_region *nd_region)
return dev;
 }
 
-static bool uuid_is_null(u8 *uuid)
-{
-   static const u8 null_uuid[16];
-
-   return (memcmp(uuid, null_uuid, 16) == 0);
-}
-
 /**
  * nd_btt_arena_is_valid - check if the metadata layout is valid
  * @nd_btt:device with BTT geometry and backing device info
@@ -249,7 +242,7 @@ bool nd_btt_arena_is_valid(struct nd_btt *nd_btt, struct 
btt_sb *super)
if (memcmp(super->signature, BTT_SIG, BTT_SIG_LEN) != 0)
return false;
 
-   if (!uuid_is_null(super->parent_uuid))
+   if (!guid_is_null((guid_t *)>parent_uuid))
if (memcmp(super->parent_uuid, parent_uuid, 16) != 0)
return false;
 
diff --git a/include/linux/uuid.h b/include/linux/uuid.h
index 9f97fe5ad964..51602200b539 100644
--- a/include/linux/uuid.h
+++ b/include/linux/uuid.h
@@ -35,6 +35,9 @@ typedef struct {
  */
 #defineUUID_STRING_LEN 36
 
+extern const guid_t guid_null;
+extern const uuid_t uuid_null;
+
 static inline bool guid_equal(const guid_t *u1, const guid_t *u2)
 {
return memcmp(u1, u2, sizeof(guid_t)) == 0;
@@ -45,6 +48,11 @@ static inline void guid_copy(guid_t *dst, const guid_t *src)
memcpy(dst, src, sizeof(guid_t));
 }
 
+static inline bool guid_is_null(guid_t *guid)
+{
+   return guid_equal(guid, _null);
+}
+
 static inline bool uuid_equal(const uuid_t *u1, const uuid_t *u2)
 {
return memcmp(u1, u2, sizeof(uuid_t)) == 0;
@@ -55,6 +63,11 @@ static inline void uuid_copy(uuid_t *dst, const uuid_t *src)
memcpy(dst, src, sizeof(uuid_t));
 }
 
+static inline bool uuid_is_null(uuid_t *uuid)
+{
+   return uuid_equal(uuid, _null);
+}
+
 void generate_random_uuid(unsigned char uuid[16]);
 
 extern void guid_gen(guid_t *u);
diff --git a/lib/uuid.c b/lib/uuid.c
index f7116ed88e01..680b9fb9ba09 100644
--- a/lib/uuid.c
+++ b/lib/uuid.c
@@ -21,6 +21,11 @@
 #include 
 #include 
 
+const guid_t guid_null;
+EXPORT_SYMBOL(guid_null);
+const uuid_t uuid_null;
+EXPORT_SYMBOL(uuid_null);
+
 const u8 guid_index[16] = {3,2,1,0,5,4,7,6,8,9,10,11,12,13,14,15};
 const u8 uuid_index[16] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
 
-- 
2.11.0

[PATCH 12/24] S390/sysinfo: use uuid_is_null instead of opencoding it

2017-05-31 Thread Christoph Hellwig

And switch to use uuid_t instead of the old uuid_be type.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
Reviewed-by: Andy Shevchenko 
---
 arch/s390/include/asm/sysinfo.h | 4 ++--
 arch/s390/kernel/sysinfo.c  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/s390/include/asm/sysinfo.h b/arch/s390/include/asm/sysinfo.h
index e784bed6ed7f..2b498e58b914 100644
--- a/arch/s390/include/asm/sysinfo.h
+++ b/arch/s390/include/asm/sysinfo.h
@@ -109,7 +109,7 @@ struct sysinfo_2_2_2 {
unsigned short cpus_shared;
char reserved_4[3];
unsigned char vsne;
-   uuid_be uuid;
+   uuid_t uuid;
char reserved_5[160];
char ext_name[256];
 };
@@ -134,7 +134,7 @@ struct sysinfo_3_2_2 {
char reserved_1[3];
unsigned char evmne;
unsigned int reserved_2;
-   uuid_be uuid;
+   uuid_t uuid;
} vm[8];
char reserved_3[1504];
char ext_names[8][256];
diff --git a/arch/s390/kernel/sysinfo.c b/arch/s390/kernel/sysinfo.c
index eefcb54872a5..fb869b103825 100644
--- a/arch/s390/kernel/sysinfo.c
+++ b/arch/s390/kernel/sysinfo.c
@@ -242,7 +242,7 @@ static void print_ext_name(struct seq_file *m, int lvl,
 
 static void print_uuid(struct seq_file *m, int i, struct sysinfo_3_2_2 *info)
 {
-   if (!memcmp(>vm[i].uuid, _UUID_BE, sizeof(uuid_be)))
+   if (uuid_is_null(>vm[i].uuid))
return;
seq_printf(m, "VM%02d UUID:%pUb\n", i, >vm[i].uuid);
 }
-- 
2.11.0

[PATCH 13/24] xfs: remove uuid_getnodeuniq and xfs_uu_t

2017-05-31 Thread Christoph Hellwig

Opencode uuid_getnodeuniq in the only caller, and directly decode
the uuid_t representation instead of using a structure cast for it.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/uuid.c  | 25 -
 fs/xfs/uuid.h  |  1 -
 fs/xfs/xfs_mount.c |  5 -
 3 files changed, 4 insertions(+), 27 deletions(-)

diff --git a/fs/xfs/uuid.c b/fs/xfs/uuid.c
index 29ed78c8637b..737c186ea98b 100644
--- a/fs/xfs/uuid.c
+++ b/fs/xfs/uuid.c
@@ -17,31 +17,6 @@
  */
 #include 
 
-/* IRIX interpretation of an uuid_t */
-typedef struct {
-   __be32  uu_timelow;
-   __be16  uu_timemid;
-   __be16  uu_timehi;
-   __be16  uu_clockseq;
-   __be16  uu_node[3];
-} xfs_uu_t;
-
-/*
- * uuid_getnodeuniq - obtain the node unique fields of a UUID.
- *
- * This is not in any way a standard or condoned UUID function;
- * it just something that's needed for user-level file handles.
- */
-void
-uuid_getnodeuniq(uuid_t *uuid, int fsid [2])
-{
-   xfs_uu_t *uup = (xfs_uu_t *)uuid;
-
-   fsid[0] = (be16_to_cpu(uup->uu_clockseq) << 16) |
-  be16_to_cpu(uup->uu_timemid);
-   fsid[1] = be32_to_cpu(uup->uu_timelow);
-}
-
 int
 uuid_is_nil(uuid_t *uuid)
 {
diff --git a/fs/xfs/uuid.h b/fs/xfs/uuid.h
index 86bbed071e79..5aea49bf0963 100644
--- a/fs/xfs/uuid.h
+++ b/fs/xfs/uuid.h
@@ -19,6 +19,5 @@
 #define __XFS_SUPPORT_UUID_H__
 
 extern int uuid_is_nil(uuid_t *uuid);
-extern void uuid_getnodeuniq(uuid_t *uuid, int fsid [2]);
 
 #endif /* __XFS_SUPPORT_UUID_H__ */
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 2eaf81859166..51f7d03ef86c 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -793,7 +793,10 @@ xfs_mountfs(
 *  Copies the low order bits of the timestamp and the randomly
 *  set "sequence" number out of a UUID.
 */
-   uuid_getnodeuniq(>sb_uuid, mp->m_fixedfsid);
+   mp->m_fixedfsid[0] =
+   (get_unaligned_be16(>sb_uuid.b[8]) << 16) |
+get_unaligned_be16(>sb_uuid.b[4]);
+   mp->m_fixedfsid[1] = get_unaligned_be32(>sb_uuid.b[0]);
 
mp->m_dmevmask = 0; /* not persistent; set after each mount */
 
-- 
2.11.0

[PATCH 17/24] fs: switch ->s_uuid to uuid_t

2017-05-31 Thread Christoph Hellwig

For some file systems we still memcpy into it, but in various places this
already allows us to use the proper uuid helpers.  More to come..

Signed-off-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
Acked-by: Mimi Zohar  (Changes to IMA/EVM)
Reviewed-by: Andy Shevchenko 
---
 drivers/xen/tmem.c  |  6 +++---
 fs/ext4/super.c |  2 +-
 fs/f2fs/super.c |  2 +-
 fs/gfs2/ops_fstype.c|  2 +-
 fs/gfs2/sys.c   | 22 +-
 fs/ocfs2/super.c|  2 +-
 fs/overlayfs/copy_up.c  |  5 ++---
 fs/overlayfs/namei.c|  2 +-
 fs/xfs/xfs_mount.c  |  3 +--
 include/linux/cleancache.h  |  2 +-
 include/linux/fs.h  |  5 +++--
 mm/cleancache.c |  2 +-
 security/integrity/evm/evm_crypto.c |  2 +-
 security/integrity/ima/ima_policy.c |  2 +-
 14 files changed, 23 insertions(+), 36 deletions(-)

diff --git a/drivers/xen/tmem.c b/drivers/xen/tmem.c
index 4ac2ca8a7656..bf13d1ec51f3 100644
--- a/drivers/xen/tmem.c
+++ b/drivers/xen/tmem.c
@@ -233,12 +233,12 @@ static int tmem_cleancache_init_fs(size_t pagesize)
return xen_tmem_new_pool(uuid_private, 0, pagesize);
 }
 
-static int tmem_cleancache_init_shared_fs(char *uuid, size_t pagesize)
+static int tmem_cleancache_init_shared_fs(uuid_t *uuid, size_t pagesize)
 {
struct tmem_pool_uuid shared_uuid;
 
-   shared_uuid.uuid_lo = *(u64 *)uuid;
-   shared_uuid.uuid_hi = *(u64 *)([8]);
+   shared_uuid.uuid_lo = *(u64 *)>b[0];
+   shared_uuid.uuid_hi = *(u64 *)>b[8];
return xen_tmem_new_pool(shared_uuid, TMEM_POOL_SHARED, pagesize);
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0b177da9ea82..6e3b4186a22f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3952,7 +3952,7 @@ static int ext4_fill_super(struct super_block *sb, void 
*data, int silent)
sb->s_qcop = _qctl_operations;
sb->s_quota_types = QTYPE_MASK_USR | QTYPE_MASK_GRP | QTYPE_MASK_PRJ;
 #endif
-   memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));
+   memcpy(>s_uuid, es->s_uuid, sizeof(es->s_uuid));
 
INIT_LIST_HEAD(>s_orphan); /* unlinked but open files */
mutex_init(>s_orphan_lock);
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 83355ec4a92c..0b89b0b7b9f7 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1937,7 +1937,7 @@ static int f2fs_fill_super(struct super_block *sb, void 
*data, int silent)
sb->s_time_gran = 1;
sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
(test_opt(sbi, POSIX_ACL) ? MS_POSIXACL : 0);
-   memcpy(sb->s_uuid, raw_super->uuid, sizeof(raw_super->uuid));
+   memcpy(>s_uuid, raw_super->uuid, sizeof(raw_super->uuid));
 
/* init f2fs-specific super block info */
sbi->valid_super_block = valid_super_block;
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index ed67548b286c..b92135c202c2 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -203,7 +203,7 @@ static void gfs2_sb_in(struct gfs2_sbd *sdp, const void 
*buf)
 
memcpy(sb->sb_lockproto, str->sb_lockproto, GFS2_LOCKNAME_LEN);
memcpy(sb->sb_locktable, str->sb_locktable, GFS2_LOCKNAME_LEN);
-   memcpy(s->s_uuid, str->sb_uuid, 16);
+   memcpy(>s_uuid, str->sb_uuid, 16);
 }
 
 /**
diff --git a/fs/gfs2/sys.c b/fs/gfs2/sys.c
index 7a515345610c..e77bc52b468f 100644
--- a/fs/gfs2/sys.c
+++ b/fs/gfs2/sys.c
@@ -71,25 +71,14 @@ static ssize_t fsname_show(struct gfs2_sbd *sdp, char *buf)
return snprintf(buf, PAGE_SIZE, "%s\n", sdp->sd_fsname);
 }
 
-static int gfs2_uuid_valid(const u8 *uuid)
-{
-   int i;
-
-   for (i = 0; i < 16; i++) {
-   if (uuid[i])
-   return 1;
-   }
-   return 0;
-}
-
 static ssize_t uuid_show(struct gfs2_sbd *sdp, char *buf)
 {
struct super_block *s = sdp->sd_vfs;
-   const u8 *uuid = s->s_uuid;
+
buf[0] = '\0';
-   if (!gfs2_uuid_valid(uuid))
+   if (uuid_is_null(>s_uuid))
return 0;
-   return snprintf(buf, PAGE_SIZE, "%pUB\n", uuid);
+   return snprintf(buf, PAGE_SIZE, "%pUB\n", >s_uuid);
 }
 
 static ssize_t freeze_show(struct gfs2_sbd *sdp, char *buf)
@@ -712,14 +701,13 @@ static int gfs2_uevent(struct kset *kset, struct kobject 
*kobj,
 {
struct gfs2_sbd *sdp = container_of(kobj, struct gfs2_sbd, sd_kobj);
struct super_block *s = sdp->sd_vfs;
-   const u8 *uuid = s->s_uuid;
 
add_uevent_var(env, "LOCKTABLE=%s", sdp->sd_table_name);
add_uevent_var(env, "LOCKPROTO=%s", sdp->sd_proto_name);
if (!test_bit(SDF_NOJOURNALID, >sd_flags))
add_uevent_var(env, "JOURNALID=%d", sdp->sd_lockstruct.ls_jid);
-   if (gfs2_uuid_valid(uuid))
-   add_uevent_var(env,

[PATCH 15/24] block: remove blk_part_pack_uuid

2017-05-31 Thread Christoph Hellwig

This helper was only used by IMA of all things, which would get spurious
errors if CONFIG_BLOCK is disabled.  Just opencode the call there.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
Acked-by: Mimi Zohar 
Reviewed-by: Andy Shevchenko 
---
 include/linux/genhd.h   | 11 ---
 security/integrity/ima/ima_policy.c |  3 +--
 2 files changed, 1 insertion(+), 13 deletions(-)

diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index acff9437e5c3..e619fae2f037 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -219,12 +219,6 @@ static inline struct gendisk *part_to_disk(struct 
hd_struct *part)
return NULL;
 }
 
-static inline int blk_part_pack_uuid(const u8 *uuid_str, u8 *to)
-{
-   uuid_be_to_bin(uuid_str, (uuid_be *)to);
-   return 0;
-}
-
 static inline int disk_max_parts(struct gendisk *disk)
 {
if (disk->flags & GENHD_FL_EXT_DEVT)
@@ -736,11 +730,6 @@ static inline dev_t blk_lookup_devt(const char *name, int 
partno)
dev_t devt = MKDEV(0, 0);
return devt;
 }
-
-static inline int blk_part_pack_uuid(const u8 *uuid_str, u8 *to)
-{
-   return -EINVAL;
-}
 #endif /* CONFIG_BLOCK */
 
 #endif /* _LINUX_GENHD_H */
diff --git a/security/integrity/ima/ima_policy.c 
b/security/integrity/ima/ima_policy.c
index 3ab1067db624..49fbc3e8f012 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -717,8 +717,7 @@ static int ima_parse_rule(char *rule, struct ima_rule_entry 
*entry)
break;
}
 
-   result = blk_part_pack_uuid(args[0].from,
-   entry->fsuuid);
+   result = uuid_to_bin(args[0].from, (uuid_t 
*)>fsuuid);
if (!result)
entry->flags |= IMA_FSUUID;
break;
-- 
2.11.0

[PATCH 16/24] ima/policy: switch to use uuid_t

2017-05-31 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
Acked-by: Mimi Zohar 
Reviewed-by: Andy Shevchenko 
---
 security/integrity/ima/ima_policy.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/security/integrity/ima/ima_policy.c 
b/security/integrity/ima/ima_policy.c
index 49fbc3e8f012..9a7c7cbdbe7c 100644
--- a/security/integrity/ima/ima_policy.c
+++ b/security/integrity/ima/ima_policy.c
@@ -61,7 +61,7 @@ struct ima_rule_entry {
enum ima_hooks func;
int mask;
unsigned long fsmagic;
-   u8 fsuuid[16];
+   uuid_t fsuuid;
kuid_t uid;
kuid_t fowner;
bool (*uid_op)(kuid_t, kuid_t);/* Handlers for operators   */
@@ -244,7 +244,7 @@ static bool ima_match_rules(struct ima_rule_entry *rule, 
struct inode *inode,
&& rule->fsmagic != inode->i_sb->s_magic)
return false;
if ((rule->flags & IMA_FSUUID) &&
-   memcmp(rule->fsuuid, inode->i_sb->s_uuid, sizeof(rule->fsuuid)))
+   memcmp(>fsuuid, inode->i_sb->s_uuid, sizeof(rule->fsuuid)))
return false;
if ((rule->flags & IMA_UID) && !rule->uid_op(cred->uid, rule->uid))
return false;
@@ -711,13 +711,12 @@ static int ima_parse_rule(char *rule, struct 
ima_rule_entry *entry)
case Opt_fsuuid:
ima_log_string(ab, "fsuuid", args[0].from);
 
-   if (memchr_inv(entry->fsuuid, 0x00,
-  sizeof(entry->fsuuid))) {
+   if (uuid_is_null(>fsuuid)) {
result = -EINVAL;
break;
}
 
-   result = uuid_to_bin(args[0].from, (uuid_t 
*)>fsuuid);
+   result = uuid_parse(args[0].from, >fsuuid);
if (!result)
entry->flags |= IMA_FSUUID;
break;
@@ -1086,7 +1085,7 @@ int ima_policy_show(struct seq_file *m, void *v)
}
 
if (entry->flags & IMA_FSUUID) {
-   seq_printf(m, "fsuuid=%pU", entry->fsuuid);
+   seq_printf(m, "fsuuid=%pU", >fsuuid);
seq_puts(m, " ");
}
 
-- 
2.11.0

[PATCH 21/24] nvme: switch to uuid_t

2017-05-31 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
Reviewed-by: Andy Shevchenko 
---
 drivers/nvme/host/fabrics.c | 8 
 drivers/nvme/host/fabrics.h | 2 +-
 drivers/nvme/host/fc.c  | 3 +--
 drivers/nvme/target/nvmet.h | 1 +
 include/linux/nvme-fc.h | 3 +--
 include/linux/nvme.h| 3 ++-
 6 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 990e6fb32a63..c190d7e36900 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -58,7 +58,7 @@ static struct nvmf_host *nvmf_host_add(const char *hostnqn)
 
kref_init(>ref);
memcpy(host->nqn, hostnqn, NVMF_NQN_SIZE);
-   uuid_be_gen(>id);
+   uuid_gen(>id);
 
list_add_tail(>list, _hosts);
 out_unlock:
@@ -75,7 +75,7 @@ static struct nvmf_host *nvmf_host_default(void)
return NULL;
 
kref_init(>ref);
-   uuid_be_gen(>id);
+   uuid_gen(>id);
snprintf(host->nqn, NVMF_NQN_SIZE,
"nqn.2014-08.org.nvmexpress:NVMf:uuid:%pUb", >id);
 
@@ -395,7 +395,7 @@ int nvmf_connect_admin_queue(struct nvme_ctrl *ctrl)
if (!data)
return -ENOMEM;
 
-   memcpy(>hostid, >opts->host->id, sizeof(uuid_be));
+   uuid_copy(>hostid, >opts->host->id);
data->cntlid = cpu_to_le16(0x);
strncpy(data->subsysnqn, ctrl->opts->subsysnqn, NVMF_NQN_SIZE);
strncpy(data->hostnqn, ctrl->opts->host->nqn, NVMF_NQN_SIZE);
@@ -454,7 +454,7 @@ int nvmf_connect_io_queue(struct nvme_ctrl *ctrl, u16 qid)
if (!data)
return -ENOMEM;
 
-   memcpy(>hostid, >opts->host->id, sizeof(uuid_be));
+   uuid_copy(>hostid, >opts->host->id);
data->cntlid = cpu_to_le16(ctrl->cntlid);
strncpy(data->subsysnqn, ctrl->opts->subsysnqn, NVMF_NQN_SIZE);
strncpy(data->hostnqn, ctrl->opts->host->nqn, NVMF_NQN_SIZE);
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index f5a9c1fb186f..29be7600689d 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -36,7 +36,7 @@ struct nvmf_host {
struct kref ref;
struct list_headlist;
charnqn[NVMF_NQN_SIZE];
-   uuid_be id;
+   uuid_t  id;
 };
 
 /**
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 5b14cbefb724..96b983bb44bd 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -878,8 +878,7 @@ nvme_fc_connect_admin_queue(struct nvme_fc_ctrl *ctrl,
assoc_rqst->assoc_cmd.sqsize = cpu_to_be16(qsize);
/* Linux supports only Dynamic controllers */
assoc_rqst->assoc_cmd.cntlid = cpu_to_be16(0x);
-   memcpy(_rqst->assoc_cmd.hostid, >ctrl.opts->host->id,
-   min_t(size_t, FCNVME_ASSOC_HOSTID_LEN, sizeof(uuid_be)));
+   uuid_copy(_rqst->assoc_cmd.hostid, >ctrl.opts->host->id);
strncpy(assoc_rqst->assoc_cmd.hostnqn, ctrl->ctrl.opts->host->nqn,
min(FCNVME_ASSOC_HOSTNQN_LEN, NVMF_NQN_SIZE));
strncpy(assoc_rqst->assoc_cmd.subnqn, ctrl->ctrl.opts->subsysnqn,
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index cfc5c7fb0ab7..8ff6e430b30a 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
diff --git a/include/linux/nvme-fc.h b/include/linux/nvme-fc.h
index e997c4a49a88..bc711a10be05 100644
--- a/include/linux/nvme-fc.h
+++ b/include/linux/nvme-fc.h
@@ -177,7 +177,6 @@ struct fcnvme_lsdesc_rjt {
 };
 
 
-#define FCNVME_ASSOC_HOSTID_LEN16
 #define FCNVME_ASSOC_HOSTNQN_LEN   256
 #define FCNVME_ASSOC_SUBNQN_LEN256
 
@@ -191,7 +190,7 @@ struct fcnvme_lsdesc_cr_assoc_cmd {
__be16  cntlid;
__be16  sqsize;
__be32  rsvd52;
-   u8  hostid[FCNVME_ASSOC_HOSTID_LEN];
+   uuid_t  hostid;
u8  hostnqn[FCNVME_ASSOC_HOSTNQN_LEN];
u8  subnqn[FCNVME_ASSOC_SUBNQN_LEN];
u8  rsvd632[384];
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index b625bacf37ef..e400a69fa1d3 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -16,6 +16,7 @@
 #define _LINUX_NVME_H
 
 #include 
+#include 
 
 /* NQN names in commands fields specified one size */
 #define NVMF_NQN_FIELD_LEN 256
@@ -843,7 +844,7 @@ struct nvmf_connect_command {
 };
 
 struct nvmf_connect_data {
-   __u8hostid[16];
+   uuid_t  hostid;
__le16  cntlid;
charresv4[238];
charsubsysnqn[NVMF_NQN_FIELD_LEN];
-- 
2.11.0

[PATCH 18/24] overlayfs: use uuid_t instead of uuid_be

2017-05-31 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
Reviewed-by: Andy Shevchenko 
---
 fs/overlayfs/copy_up.c   | 2 +-
 fs/overlayfs/overlayfs.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/overlayfs/copy_up.c b/fs/overlayfs/copy_up.c
index 5b795873f7fa..2a67e8c13098 100644
--- a/fs/overlayfs/copy_up.c
+++ b/fs/overlayfs/copy_up.c
@@ -233,7 +233,7 @@ int ovl_set_attr(struct dentry *upperdentry, struct kstat 
*stat)
return err;
 }
 
-static struct ovl_fh *ovl_encode_fh(struct dentry *lower, uuid_be *uuid)
+static struct ovl_fh *ovl_encode_fh(struct dentry *lower, uuid_t *uuid)
 {
struct ovl_fh *fh;
int fh_type, fh_len, dwords;
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index caa36cb9c46d..cb0fc450419b 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -55,7 +55,7 @@ struct ovl_fh {
u8 len; /* size of this header + size of fid */
u8 flags;   /* OVL_FH_FLAG_* */
u8 type;/* fid_type of fid */
-   uuid_be uuid;   /* uuid of filesystem */
+   uuid_t uuid;/* uuid of filesystem */
u8 fid[0];  /* file identifier */
 } __packed;
 
-- 
2.11.0

[PATCH 19/24] partitions/ldm: switch to use uuid_t

2017-05-31 Thread Christoph Hellwig

And the uuid helpers.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
Reviewed-by: Andy Shevchenko 
---
 block/partitions/ldm.c | 10 +-
 block/partitions/ldm.h |  6 ++
 2 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/block/partitions/ldm.c b/block/partitions/ldm.c
index edcea70674c9..2a365c756648 100644
--- a/block/partitions/ldm.c
+++ b/block/partitions/ldm.c
@@ -115,7 +115,7 @@ static bool ldm_parse_privhead(const u8 *data, struct 
privhead *ph)
ldm_error("PRIVHEAD disk size doesn't match real disk size");
return false;
}
-   if (uuid_be_to_bin(data + 0x0030, (uuid_be *)ph->disk_id)) {
+   if (uuid_parse(data + 0x0030, >disk_id)) {
ldm_error("PRIVHEAD contains an invalid GUID.");
return false;
}
@@ -234,7 +234,7 @@ static bool ldm_compare_privheads (const struct privhead 
*ph1,
(ph1->logical_disk_size  == ph2->logical_disk_size) &&
(ph1->config_start   == ph2->config_start)  &&
(ph1->config_size== ph2->config_size)   &&
-   !memcmp (ph1->disk_id, ph2->disk_id, GUID_SIZE));
+   uuid_equal(>disk_id, >disk_id));
 }
 
 /**
@@ -557,7 +557,7 @@ static struct vblk * ldm_get_disk_objid (const struct ldmdb 
*ldb)
 
list_for_each (item, >v_disk) {
struct vblk *v = list_entry (item, struct vblk, list);
-   if (!memcmp (v->vblk.disk.disk_id, ldb->ph.disk_id, GUID_SIZE))
+   if (uuid_equal(>vblk.disk.disk_id, >ph.disk_id))
return v;
}
 
@@ -892,7 +892,7 @@ static bool ldm_parse_dsk3 (const u8 *buffer, int buflen, 
struct vblk *vb)
disk = >vblk.disk;
ldm_get_vstr (buffer + 0x18 + r_diskid, disk->alt_name,
sizeof (disk->alt_name));
-   if (uuid_be_to_bin(buffer + 0x19 + r_name, (uuid_be *)disk->disk_id))
+   if (uuid_parse(buffer + 0x19 + r_name, >disk_id))
return false;
 
return true;
@@ -927,7 +927,7 @@ static bool ldm_parse_dsk4 (const u8 *buffer, int buflen, 
struct vblk *vb)
return false;
 
disk = >vblk.disk;
-   memcpy (disk->disk_id, buffer + 0x18 + r_name, GUID_SIZE);
+   uuid_copy(>disk_id, (uuid_t *)(buffer + 0x18 + r_name));
return true;
 }
 
diff --git a/block/partitions/ldm.h b/block/partitions/ldm.h
index 374242c0971a..f4c6055df956 100644
--- a/block/partitions/ldm.h
+++ b/block/partitions/ldm.h
@@ -112,8 +112,6 @@ struct frag {   /* VBLK 
Fragment handling */
 
 /* In memory LDM database structures. */
 
-#define GUID_SIZE  16
-
 struct privhead {  /* Offsets and sizes are in sectors. */
u16 ver_major;
u16 ver_minor;
@@ -121,7 +119,7 @@ struct privhead {   /* Offsets and sizes 
are in sectors. */
u64 logical_disk_size;
u64 config_start;
u64 config_size;
-   u8  disk_id[GUID_SIZE];
+   uuid_t  disk_id;
 };
 
 struct tocblock {  /* We have exactly two bitmaps. */
@@ -154,7 +152,7 @@ struct vblk_dgrp {  /* VBLK Disk Group */
 };
 
 struct vblk_disk { /* VBLK Disk */
-   u8  disk_id[GUID_SIZE];
+   uuid_t  disk_id;
u8  alt_name[128];
 };
 
-- 
2.11.0

[PATCH 20/24] sysctl: switch to use uuid_t

2017-05-31 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
Reviewed-by: Andy Shevchenko 
---
 kernel/sysctl_binary.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index ece4b177052b..939a158eab11 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -1119,7 +1119,7 @@ static ssize_t bin_uuid(struct file *file,
/* Only supports reads */
if (oldval && oldlen) {
char buf[UUID_STRING_LEN + 1];
-   uuid_be uuid;
+   uuid_t uuid;
 
result = kernel_read(file, 0, buf, sizeof(buf) - 1);
if (result < 0)
@@ -1128,7 +1128,7 @@ static ssize_t bin_uuid(struct file *file,
buf[result] = '\0';
 
result = -EIO;
-   if (uuid_be_to_bin(buf, ))
+   if (uuid_parse(buf, ))
goto out;
 
if (oldlen > 16)
-- 
2.11.0

[PATCH 24/24] MAINTAINERS: add uuid entry

2017-05-31 Thread Christoph Hellwig

I'll keep maintaining whatever little changed we need here, with Andy as
my designated reviewer.

Signed-off-by: Christoph Hellwig 
---
 MAINTAINERS | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 053c3bdd1fe5..660c14729205 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13463,6 +13463,17 @@ W: http://en.wikipedia.org/wiki/Util-linux
 T: git git://git.kernel.org/pub/scm/utils/util-linux/util-linux.git
 S: Maintained
 
+UUID HELPERS
+M: Christoph Hellwig 
+R: Andy Shevchenko 
+L: linux-ker...@vger.kernel.org
+T: git git://git.infradead.org/users/hch/uuid.git
+F: lib/uuid.c
+F: lib/test_uuid.c
+F: include/linux/uuid.h
+F: include/uapi/linux/uuid.h
+S: Maintained
+
 UVESAFB DRIVER
 M: Michal Januszewski 
 L: linux-fb...@vger.kernel.org
-- 
2.11.0

[PATCH 23/24] tmpfs: generate random sb->s_uuid

2017-05-31 Thread Christoph Hellwig

From: Amir Goldstein 

This is used by overlayfs to encode intrasystem unique file handles.

Suggested-by: Miklos Szeredi 
Cc: Hugh Dickins 
Cc: Andrew Morton 
Signed-off-by: Amir Goldstein 
Signed-off-by: Christoph Hellwig 
---
 mm/shmem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index e67d6ba4e98e..391f2dcca727 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -75,6 +75,7 @@ static struct vfsmount *shm_mnt;
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -3761,6 +3762,7 @@ int shmem_fill_super(struct super_block *sb, void *data, 
int silent)
 #ifdef CONFIG_TMPFS_POSIX_ACL
sb->s_flags |= MS_POSIXACL;
 #endif
+   uuid_gen(>s_uuid);
 
inode = shmem_get_inode(sb, NULL, S_IFDIR | sbinfo->mode, 0, 
VM_NORESERVE);
if (!inode)
-- 
2.11.0

[PATCH 22/24] scsi_debug: switch to uuid_t

2017-05-31 Thread Christoph Hellwig

Signed-off-by: Christoph Hellwig 
Reviewed-by: Amir Goldstein 
Reviewed-by: Andy Shevchenko 
---
 drivers/scsi/scsi_debug.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 17249c3650fe..35ee09644cfb 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -245,7 +245,7 @@ struct sdebug_dev_info {
unsigned int channel;
unsigned int target;
u64 lun;
-   uuid_be lu_name;
+   uuid_t lu_name;
struct sdebug_host_info *sdbg_host;
unsigned long uas_bm[1];
atomic_t num_in_q;
@@ -965,7 +965,7 @@ static const u64 naa3_comp_c = 0x3110ULL;
 static int inquiry_vpd_83(unsigned char *arr, int port_group_id,
  int target_dev_id, int dev_id_num,
  const char *dev_id_str, int dev_id_str_len,
- const uuid_be *lu_name)
+ const uuid_t *lu_name)
 {
int num, port_a;
char b[32];
@@ -3568,7 +3568,7 @@ static void sdebug_q_cmd_wq_complete(struct work_struct 
*work)
 }
 
 static bool got_shared_uuid;
-static uuid_be shared_uuid;
+static uuid_t shared_uuid;
 
 static struct sdebug_dev_info *sdebug_device_create(
struct sdebug_host_info *sdbg_host, gfp_t flags)
@@ -3578,12 +3578,12 @@ static struct sdebug_dev_info *sdebug_device_create(
devip = kzalloc(sizeof(*devip), flags);
if (devip) {
if (sdebug_uuid_ctl == 1)
-   uuid_be_gen(>lu_name);
+   uuid_gen(>lu_name);
else if (sdebug_uuid_ctl == 2) {
if (got_shared_uuid)
devip->lu_name = shared_uuid;
else {
-   uuid_be_gen(_uuid);
+   uuid_gen(_uuid);
got_shared_uuid = true;
devip->lu_name = shared_uuid;
}
-- 
2.11.0

[PATCH 10/24] uuid: hoist helpers uuid_equal() and uuid_copy() from xfs

2017-05-31 Thread Christoph Hellwig

These helper are used to compare and copy two uuid_t type objects.

Signed-off-by: Amir Goldstein 
[hch: also provide the respective guid_ versions]
Signed-off-by: Christoph Hellwig 
Reviewed-by: Andy Shevchenko 
---
 fs/xfs/uuid.c|  6 --
 fs/xfs/uuid.h|  7 ---
 include/linux/uuid.h | 20 
 lib/test_uuid.c  |  4 ++--
 4 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/uuid.c b/fs/xfs/uuid.c
index b83f76b6d410..29ed78c8637b 100644
--- a/fs/xfs/uuid.c
+++ b/fs/xfs/uuid.c
@@ -55,9 +55,3 @@ uuid_is_nil(uuid_t *uuid)
if (*cp++) return 0;/* not nil */
return 1;   /* is nil */
 }
-
-int
-uuid_equal(uuid_t *uuid1, uuid_t *uuid2)
-{
-   return memcmp(uuid1, uuid2, sizeof(uuid_t)) ? 0 : 1;
-}
diff --git a/fs/xfs/uuid.h b/fs/xfs/uuid.h
index 4f1441ba4fa5..86bbed071e79 100644
--- a/fs/xfs/uuid.h
+++ b/fs/xfs/uuid.h
@@ -19,13 +19,6 @@
 #define __XFS_SUPPORT_UUID_H__
 
 extern int uuid_is_nil(uuid_t *uuid);
-extern int uuid_equal(uuid_t *uuid1, uuid_t *uuid2);
 extern void uuid_getnodeuniq(uuid_t *uuid, int fsid [2]);
 
-static inline void
-uuid_copy(uuid_t *dst, uuid_t *src)
-{
-   memcpy(dst, src, sizeof(uuid_t));
-}
-
 #endif /* __XFS_SUPPORT_UUID_H__ */
diff --git a/include/linux/uuid.h b/include/linux/uuid.h
index 82b165b579f5..9f97fe5ad964 100644
--- a/include/linux/uuid.h
+++ b/include/linux/uuid.h
@@ -35,6 +35,26 @@ typedef struct {
  */
 #defineUUID_STRING_LEN 36
 
+static inline bool guid_equal(const guid_t *u1, const guid_t *u2)
+{
+   return memcmp(u1, u2, sizeof(guid_t)) == 0;
+}
+
+static inline void guid_copy(guid_t *dst, const guid_t *src)
+{
+   memcpy(dst, src, sizeof(guid_t));
+}
+
+static inline bool uuid_equal(const uuid_t *u1, const uuid_t *u2)
+{
+   return memcmp(u1, u2, sizeof(uuid_t)) == 0;
+}
+
+static inline void uuid_copy(uuid_t *dst, const uuid_t *src)
+{
+   memcpy(dst, src, sizeof(uuid_t));
+}
+
 void generate_random_uuid(unsigned char uuid[16]);
 
 extern void guid_gen(guid_t *u);
diff --git a/lib/test_uuid.c b/lib/test_uuid.c
index edda536a7b45..516dc3e9362d 100644
--- a/lib/test_uuid.c
+++ b/lib/test_uuid.c
@@ -71,7 +71,7 @@ static void __init test_uuid_test(const struct test_uuid_data 
*data)
test_uuid_failed("conversion", false, false, data->uuid, NULL);
 
total_tests++;
-   if (uuid_le_cmp(data->le, le)) {
+   if (!guid_equal(>le, )) {
sprintf(buf, "%pUl", );
test_uuid_failed("cmp", false, false, data->uuid, buf);
}
@@ -82,7 +82,7 @@ static void __init test_uuid_test(const struct test_uuid_data 
*data)
test_uuid_failed("conversion", false, true, data->uuid, NULL);
 
total_tests++;
-   if (uuid_be_cmp(data->be, be)) {
+   if (uuid_equal(>be, )) {
sprintf(buf, "%pUb", );
test_uuid_failed("cmp", false, true, data->uuid, buf);
}
-- 
2.11.0

Re: [PATCH v3 5/9] blk-mq: fix blk_mq_quiesce_queue

2017-05-31 Thread Bart Van Assche

On Wed, 2017-05-31 at 20:37 +0800, Ming Lei wrote:
> 
> + /* wait until queue is unquiesced */
> + wait_event_cmd(q->quiesce_wq, !blk_queue_quiesced(q),
> + may_sleep ?
> + srcu_read_unlock(>queue_rq_srcu, *srcu_idx) :
> + rcu_read_unlock(),
> + may_sleep ?
> + *srcu_idx = srcu_read_lock(>queue_rq_srcu) :
> + rcu_read_lock());
> +
>   if (q->elevator)
>   goto insert;

What I see is that in this patch a new waitqueue has been introduced
(quiesce_wq) and also that an explanation of why you think this new waitqueue
is needed is missing completely. Why is it that you think that the
synchronize_scru() and synchronize_rcu() calls in blk_mq_quiesce_queue() are
not sufficient? If this new waitqueue is not needed, please remove that
waitqueue again.

Bart.

Re: [PATCH v2] cfq-iosched: fix the delay of cfq_group's vdisktime under iops mode

2017-05-31 Thread Jens Axboe

On 05/30/2017 09:09 PM, Hou Tao wrote:
> Hi Jens,
> 
> I didn't found the patch in your linux-block git tree and the vanilla git 
> tree.
> Maybe you have forgot this CFQ fix ?

Looks like that did get missed, sorry about that. I've queued it up now.

-- 
Jens Axboe

Re: [PATCH v3 3/9] blk-mq: use the introduced blk_mq_unquiesce_queue()

2017-05-31 Thread Bart Van Assche

On Wed, 2017-05-31 at 20:37 +0800, Ming Lei wrote:
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 99e16ac479e3..ffcf05765e2b 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -3031,7 +3031,10 @@ scsi_internal_device_unblock(struct scsi_device *sdev,
>   return -EINVAL;
>  
>   if (q->mq_ops) {
> - blk_mq_start_stopped_hw_queues(q, false);
> + if (blk_queue_quiesced(q))
> + blk_mq_unquiesce_queue(q);
> + else
> + blk_mq_start_stopped_hw_queues(q, false);
>   } else {
>   spin_lock_irqsave(q->queue_lock, flags);
>   blk_start_queue(q);

As I commented on v2, this change is really wrong. All what's needed here is
a call to blk_mq_unquiesce_queue() and nothing else. Adding a call to
blk_mq_start_stopped_hw_queues() is wrong because it makes it impossible to
use the STOPPED flag in the SCSI core to make the block layer core stop calling
.queue_rq() if a SCSI LLD returns "busy".

Bart.

[xfstests PATCH v3 1/5] generic: add a writeback error handling test

2017-05-31 Thread Jeff Layton

I'm working on a set of kernel patches to change how writeback errors
are handled and reported in the kernel. Instead of reporting a
writeback error to only the first fsync caller on the file, I aim
to make the kernel report them once on every file description.

This patch adds a test for the new behavior. Basically, open many fds
to the same file, turn on dm_error, write to each of the fds, and then
fsync them all to ensure that they all get an error back.

To do that, I'm adding a new tools/dmerror script that the C program
can use to load the error table. For now, that's all it can do, but
we can fill it out with other commands as necessary.

Signed-off-by: Jeff Layton 
---
 common/dmerror |  13 ++--
 doc/auxiliary-programs.txt |   8 +++
 src/Makefile   |   2 +-
 src/fsync-err.c| 161 +
 tests/generic/999  |  76 +
 tests/generic/999.out  |   3 +
 tests/generic/group|   1 +
 tools/dmerror  |  44 +
 8 files changed, 302 insertions(+), 6 deletions(-)
 create mode 100644 src/fsync-err.c
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out
 create mode 100755 tools/dmerror

diff --git a/common/dmerror b/common/dmerror
index d46c5d0b7266..238baa213b1f 100644
--- a/common/dmerror
+++ b/common/dmerror
@@ -23,22 +23,25 @@ if [ $? -eq 0 ]; then
_notrun "Cannot run tests with DAX on dmerror devices"
 fi
 
-_dmerror_init()
+_dmerror_setup()
 {
local dm_backing_dev=$SCRATCH_DEV
 
-   $DMSETUP_PROG remove error-test > /dev/null 2>&1
-
local blk_dev_size=`blockdev --getsz $dm_backing_dev`
 
DMERROR_DEV='/dev/mapper/error-test'
 
DMLINEAR_TABLE="0 $blk_dev_size linear $dm_backing_dev 0"
 
+   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
+}
+
+_dmerror_init()
+{
+   _dmerror_setup
+   $DMSETUP_PROG remove error-test > /dev/null 2>&1
$DMSETUP_PROG create error-test --table "$DMLINEAR_TABLE" || \
_fatal "failed to create dm linear device"
-
-   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
 }
 
 _dmerror_mount()
diff --git a/doc/auxiliary-programs.txt b/doc/auxiliary-programs.txt
index 21ef118596b6..191ac0596511 100644
--- a/doc/auxiliary-programs.txt
+++ b/doc/auxiliary-programs.txt
@@ -16,6 +16,7 @@ note the dependency with:
 Contents:
 
  - af_unix -- Create an AF_UNIX socket
+ - fsync-err   -- tests fsync error reporting after failed writeback
  - open_by_handle  -- open_by_handle_at syscall exercise
  - stat_test   -- statx syscall exercise
  - t_dir_type  -- print directory entries and their file type
@@ -30,6 +31,13 @@ af_unix
 
The af_unix program creates an AF_UNIX socket at the given location.
 
+fsync-err
+   Specialized program for testing how the kernel reports errors that
+   occur during writeback. Works in conjunction with the dmerror script
+   in tools/ to write data to a device, and then force it to fail
+   writeback and test that errors are reported during fsync and cleared
+   afterward.
+
 open_by_handle
 
The open_by_handle program exercises the open_by_handle_at() system
diff --git a/src/Makefile b/src/Makefile
index 4ec01975f8f7..b79c4d84d31b 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -13,7 +13,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \
t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \
holetest t_truncate_self t_mmap_dio af_unix t_mmap_stale_pmd \
-   t_mmap_cow_race
+   t_mmap_cow_race fsync-err
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
diff --git a/src/fsync-err.c b/src/fsync-err.c
new file mode 100644
index ..cbeb37fb1790
--- /dev/null
+++ b/src/fsync-err.c
@@ -0,0 +1,161 @@
+/*
+ * fsync-err.c: test whether writeback errors are reported to all open fds
+ * and properly cleared as expected after being seen once on each
+ *
+ * Copyright (c) 2017: Jeff Layton 
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * btrfs has a fixed stripewidth of 64k, so we need to write enough data to
+ * ensure that we hit both stripes.
+ *
+ * FIXME: have the test script pass in the length?
+ */
+#define BUFSIZE (65 * 1024)
+
+/* FIXME: should this be tunable */
+#define NUM_FDS10
+
+static void usage() {
+   fprintf(stderr, "Usage: fsync-err \n");
+}
+
+int main(int argc, char **argv)
+{
+   int fd[NUM_FDS], ret, i;
+   char *fname, *buf;
+
+   if (argc < 1) {
+   usage();
+   return 1;
+   }
+
+   /* First argument is filename */
+   fname =

[xfstests PATCH v3 0/5] add a test for reporting writeback errors across all fds on fsync

2017-05-31 Thread Jeff Layton

This patchset is a companion to the Linux kernel patch series I recently
posted with the cover letter:

[PATCH v5 00/17] fs: introduce new writeback error reporting and convert 
ext2 and ext4 to use it

That patchset adds a new userland-visible change to report errors on
all open file descriptions when there is an error on fsync, not just
the first one to race in.

Note that this set contains a patch to emulate $SCRATCH_LOGDEV on btrfs,
but the kernel patches for that are not quite ready yet. The test did
pass on btrfs in an earlier incarnation of the set, however.

Jeff Layton (5):
  generic: add a writeback error handling test
  ext4: allow ext4 to use $SCRATCH_LOGDEV
  generic: test writeback error handling on dmerror devices
  ext3: allow it to put journal on a separate device when doing
scratch_mkfs
  btrfs: allow it to use $SCRATCH_LOGDEV

 common/dmerror |  13 ++--
 common/rc  |  16 -
 doc/auxiliary-programs.txt |   8 +++
 src/Makefile   |   2 +-
 src/fsync-err.c| 161 +
 tests/generic/998  |  64 ++
 tests/generic/998.out  |   2 +
 tests/generic/999  |  76 +
 tests/generic/999.out  |   3 +
 tests/generic/group|   2 +
 tools/dmerror  |  44 +
 11 files changed, 384 insertions(+), 7 deletions(-)
 create mode 100644 src/fsync-err.c
 create mode 100755 tests/generic/998
 create mode 100644 tests/generic/998.out
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out
 create mode 100755 tools/dmerror

-- 
2.9.4

[xfstests PATCH v3 2/5] ext4: allow ext4 to use $SCRATCH_LOGDEV

2017-05-31 Thread Jeff Layton

The writeback error handling test requires that you put the journal on a
separate device. This allows us to use dmerror to simulate data
writeback failure, without affecting the journal.

xfs already has infrastructure for this (a'la $SCRATCH_LOGDEV), so wire
up the ext4 code so that it can do the same thing when _scratch_mkfs is
called.

Signed-off-by: Jeff Layton 
Reviewed-by: Darrick J. Wong 
---
 common/rc | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/common/rc b/common/rc
index 743df427c047..391d36f373cd 100644
--- a/common/rc
+++ b/common/rc
@@ -676,6 +676,9 @@ _scratch_mkfs_ext4()
local tmp=`mktemp`
local mkfs_status
 
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \
+   mkfs_cmd="$mkfs_cmd -J device=$SCRATCH_LOGDEV"
 
_scratch_do_mkfs "$mkfs_cmd" "$mkfs_filter" $* 2>$tmp.mkfserr 
1>$tmp.mkfsstd
mkfs_status=$?
-- 
2.9.4

[xfstests PATCH v3 5/5] btrfs: allow it to use $SCRATCH_LOGDEV

2017-05-31 Thread Jeff Layton

With btrfs, we can't really put the log on a separate device. What we
can do however is mirror the metadata across two devices and make the
data striped across all devices. When we turn on dmerror then the
metadata can fall back to using the other mirror while the data errors
out.

Note that the current incarnation of btrfs has a fixed 64k stripe
width. If that ever changes or becomes settable, we may need to adjust
the amount of data that the test program writes.

Signed-off-by: Jeff Layton 
---
 common/rc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/common/rc b/common/rc
index 83765aacfb06..078270451b53 100644
--- a/common/rc
+++ b/common/rc
@@ -830,6 +830,8 @@ _scratch_mkfs()
;;
btrfs)
mkfs_cmd="$MKFS_BTRFS_PROG"
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   mkfs_cmd="$mkfs_cmd -d raid0 -m raid1 $SCRATCH_LOGDEV"
mkfs_filter="cat"
;;
ext3)
-- 
2.9.4

[xfstests PATCH v3 4/5] ext3: allow it to put journal on a separate device when doing scratch_mkfs

2017-05-31 Thread Jeff Layton

Signed-off-by: Jeff Layton 
---
 common/rc | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/common/rc b/common/rc
index 391d36f373cd..83765aacfb06 100644
--- a/common/rc
+++ b/common/rc
@@ -832,7 +832,16 @@ _scratch_mkfs()
mkfs_cmd="$MKFS_BTRFS_PROG"
mkfs_filter="cat"
;;
-   ext2|ext3)
+   ext3)
+   mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F"
+   mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \""
+
+   # put journal on separate device?
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \
+   mkfs_cmd="$mkfs_cmd -J device=$SCRATCH_LOGDEV"
+   ;;
+   ext2)
mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F"
mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \""
;;
-- 
2.9.4

[PATCH v5 02/17] fs: new infrastructure for writeback error handling and reporting

2017-05-31 Thread Jeff Layton

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure.

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
---
 drivers/dax/device.c |  1 +
 fs/block_dev.c   |  1 +
 fs/file_table.c  |  1 +
 fs/open.c|  3 +++
 include/linux/fs.h   | 53 
 mm/filemap.c | 38 +
 6 files changed, 97 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 006e657dfcb9..12943d19bfc4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -499,6 +499,7 @@ static int dax_open(struct inode *inode, struct file *filp)
inode->i_mapping = __dax_inode->i_mapping;
inode->i_mapping->host = __dax_inode;
filp->f_mapping = inode->i_mapping;
+   filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
filp->private_data = dev_dax;
inode->i_flags = S_DAX;
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 51959936..4d62fe771587 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1743,6 +1743,7 @@ static int blkdev_open(struct inode * inode, struct file 
* filp)
return -ENOMEM;
 
filp->f_mapping = bdev->bd_inode->i_mapping;
+   filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
 
return blkdev_get(bdev, filp->f_mode, filp);
 }
diff --git a/fs/file_table.c b/fs/file_table.c
index 954d510b765a..72e861a35a7f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, fmode_t 
mode,
file->f_path = *path;
file->f_inode = path->dentry->d_inode;
file->f_mapping = path->dentry->d_inode->i_mapping;
+   file->f_wb_err = filemap_sample_wb_err(file->f_mapping);
if ((mode & FMODE_READ) &&
 likely(fop->read || fop->read_iter))
mode |= FMODE_CAN_READ;
diff --git a/fs/open.c b/fs/open.c
index cd0c5be8d012..280d4a963791 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -707,6

[PATCH v5 01/17] lib: add errseq_t type and infrastructure for handling it

2017-05-31 Thread Jeff Layton

An errseq_t is a way of recording errors in one place, and allowing any
number of "subscribers" to tell whether an error has been set again
since a previous time.

It's implemented as an unsigned 32-bit value that is managed with atomic
operations. The low order bits are designated to hold an error code
(max size of MAX_ERRNO). The upper bits are used as a counter.

The API works with consumers sampling an errseq_t value at a particular
point in time. Later, that value can be used to tell whether new errors
have been set since that time.

Note that there is a 1 in 512k risk of collisions here if new errors
are being recorded frequently, since we have so few bits to use as a
counter. To mitigate this, one bit is used as a flag to tell whether the
value has been sampled since a new value was recorded. That allows
us to avoid bumping the counter if no one has sampled it since it
was last bumped.

Later patches will build on this infrastructure to change how writeback
errors are tracked in the kernel.

Signed-off-by: Jeff Layton 
Reviewed-by: NeilBrown 
---
 include/linux/errseq.h |  19 +
 lib/Makefile   |   2 +-
 lib/errseq.c   | 200 +
 3 files changed, 220 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/errseq.h
 create mode 100644 lib/errseq.c

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
new file mode 100644
index ..0d2555f310cd
--- /dev/null
+++ b/include/linux/errseq.h
@@ -0,0 +1,19 @@
+#ifndef _LINUX_ERRSEQ_H
+#define _LINUX_ERRSEQ_H
+
+/* See lib/errseq.c for more info */
+
+typedef u32errseq_t;
+
+void __errseq_set(errseq_t *eseq, int err);
+static inline void errseq_set(errseq_t *eseq, int err)
+{
+   /* Optimize for the common case of no error */
+   if (unlikely(err))
+   __errseq_set(eseq, err);
+}
+
+errseq_t errseq_sample(errseq_t *eseq);
+int errseq_check(errseq_t *eseq, errseq_t since);
+int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 0166fbc0fa81..519782d9ca3f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,7 +41,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o 
random32.o \
 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-once.o refcount.o usercopy.o
+once.o refcount.o usercopy.o errseq.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/errseq.c b/lib/errseq.c
new file mode 100644
index ..d129c0611c1f
--- /dev/null
+++ b/lib/errseq.c
@@ -0,0 +1,200 @@
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * An errseq_t is a way of recording errors in one place, and allowing any
+ * number of "subscribers" to tell whether it has changed since a previous
+ * point where it was sampled.
+ *
+ * It's implemented as an unsigned 32-bit value. The low order bits are
+ * designated to hold an error code (between 0 and -MAX_ERRNO). The upper bits
+ * are used as a counter. This is done with atomics instead of locking so that
+ * these functions can be called from any context.
+ *
+ * The general idea is for consumers to sample an errseq_t value. That value
+ * can later be used to tell whether any new errors have occurred since that
+ * sampling was done.
+ *
+ * Note that there is a risk of collisions if new errors are being recorded
+ * frequently, since we have so few bits to use as a counter.
+ *
+ * To mitigate this, one bit is used as a flag to tell whether the value has
+ * been sampled since a new value was recorded. That allows us to avoid bumping
+ * the counter if no one has sampled it since the last time an error was
+ * recorded.
+ *
+ * A new errseq_t should always be zeroed out.  A errseq_t value of all zeroes
+ * is the special (but common) case where there has never been an error. An all
+ * zero value thus serves as the "epoch" if one wishes to know whether there
+ * has ever been an error set since it was first initialized.
+ */
+
+/* The low bits are designated for error code (max of MAX_ERRNO) */
+#define ERRSEQ_SHIFT   ilog2(MAX_ERRNO + 1)
+
+/* This bit is used as a flag to indicate whether the value has been seen */
+#define ERRSEQ_SEEN(1 << ERRSEQ_SHIFT)
+
+/* The lowest bit of the counter */
+#define ERRSEQ_CTR_INC (1 << (ERRSEQ_SHIFT + 1))
+
+/**
+ * __errseq_set - set a errseq_t for later reporting
+ * @eseq: errseq_t field that should be set
+ * @err: error to set
+ *
+ * This function sets the error in *eseq, and increments the sequence counter
+ * if the last sequence was sampled at some point in the past.
+ *
+ * Any error set will always overwrite an existing error.
+ *
+ * Most callers will want to use the errseq_set inline wrapper to

[PATCH v5 04/17] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-05-31 Thread Jeff Layton

Now that we have new infrastructure for handling writeback errors using
errseq_t, we need to convert the existing code to use it. We could
attempt to retrofit the old interfaces on top of the new, but there is
a conceptual disconnect here in the case of internal callers that
invoke filemap_fdatawait and the like.

When reporting writeback errors, we will always report errors that have
occurred since a particular point in time. With the old writeback error
reporting, the time we used was "since it was last tested/cleared" which
is entirely arbitrary and potentially racy. Now, we can report the
latest error that has occurred since an arbitrary point in time
(represented as a sampled errseq_t value).

This means that we need to touch each filesystem that calls
filemap_check_errors in some fashion and ensure that we establish sane
"since" values for those callers. But...some code is shared between
filesystems and needs to be able to handle both error tracking schemes.

Add a new FS_WB_ERRSEQ flag to the fstype. When mapping_set_error is
called, set mapping->wb_err if it's set, along with setting the
"legacy" AS_EIO/AS_ENOSPC flags. When calling filemap_report_wb_err,
always clear the legacy flags out as well.

This should allow subsystems to use the new errseq_t based error
reporting while simultaneously allowing the traditional semantics of
AS_EIO/AS_ENOSPC flags.

Eventually, this flag should be removed once everything is converted
to errseq_t based error tracking.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h  |  1 +
 include/linux/pagemap.h | 32 ++--
 mm/filemap.c|  7 +++
 3 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 293cbc7f3520..2f3bcf4eb73b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2021,6 +2021,7 @@ struct file_system_type {
 #define FS_BINARY_MOUNTDATA2
 #define FS_HAS_SUBTYPE 4
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
+#define FS_WB_ERRSEQ   16  /* errseq_t writeback err tracking */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
   const char *, void *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 316a19f6b635..1dbc2dd6fdd2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -28,14 +28,34 @@ enum mapping_flags {
AS_NO_WRITEBACK_TAGS = 5,
 };
 
+/**
+ * mapping_set_error - record a writeback error in the address_space
+ * @mapping - the mapping in which an error should be set
+ * @error - the error to set in the mapping
+ *
+ * When writeback fails in some way, we must record that error so that
+ * userspace can be informed when fsync and the like are called.  We endeavor
+ * to report errors on any file that was open at the time of the error.  Some
+ * internal callers also need to know when writeback errors have occurred.
+ *
+ * When a writeback error occurs, most filesystems will want to call
+ * mapping_set_error to record the error in the mapping so that it can be
+ * reported when the application calls fsync(2).
+ */
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
-   if (unlikely(error)) {
-   if (error == -ENOSPC)
-   set_bit(AS_ENOSPC, >flags);
-   else
-   set_bit(AS_EIO, >flags);
-   }
+   if (likely(!error))
+   return;
+
+   /* Record it in wb_err if fs is using errseq_t based error tracking */
+   if (mapping->host->i_sb->s_type->fs_flags & FS_WB_ERRSEQ)
+   filemap_set_wb_err(mapping, error);
+
+   /* Unconditionally record it in flags for now, for legacy callers */
+   if (error == -ENOSPC)
+   set_bit(AS_ENOSPC, >flags);
+   else
+   set_bit(AS_EIO, >flags);
 }
 
 static inline void mapping_set_unevictable(struct address_space *mapping)
diff --git a/mm/filemap.c b/mm/filemap.c
index c5e19ea0bf12..97dc28f853fc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -580,6 +580,13 @@ int filemap_report_wb_err(struct file *file)
trace_filemap_report_wb_err(file, old);
spin_unlock(>f_lock);
}
+
+   /* Now clear the AS_* flags if any are set */
+   if (test_bit(AS_ENOSPC, >flags))
+   clear_bit(AS_ENOSPC, >flags);
+   if (test_bit(AS_EIO, >flags))
+   clear_bit(AS_EIO, >flags);
+
return err;
 }
 EXPORT_SYMBOL(filemap_report_wb_err);
-- 
2.9.4

[PATCH v5 05/17] Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors

2017-05-31 Thread Jeff Layton

I waxed a little loquacious here, but I figured that more detail was
better, and writeback error handling is so hard to get right.

Although I think we'll eventually remove it once the transition is
complete, I've gone ahead and documented the FS_WB_ERRSEQ flag as well.

Cc: Jan Kara 
Signed-off-by: Jeff Layton 
---
 Documentation/filesystems/vfs.txt | 50 ---
 1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index f42b90687d40..c3efdd833a3d 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -576,7 +576,49 @@ should clear PG_Dirty and set PG_Writeback.  It can be 
actually
 written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
-Writeback makes use of a writeback_control structure...
+Writeback makes use of a writeback_control structure to direct the
+operations.  This gives the the writepage and writepages operations some
+information about the nature of and reason for the writeback request,
+and the constraints under which it is being done.  It is also used to
+return information back to the caller about the result of a writepage or
+writepages request.
+
+Handling errors during writeback
+
+Most applications that utilize the pagecache will periodically call
+fsync to ensure that data written has made it to the backing store.
+When there is an error during writeback, expect that error to be
+reported when fsync is called.  After an error has been reported to
+fsync, subsequent fsync calls on the same file descriptor should return
+0, unless further writeback errors have occurred since the previous
+fsync.
+
+Ideally, the kernel would report an error only on file descriptions on
+which writes were done that subsequently failed to be written back.  The
+generic pagecache infrastructure does not track the file descriptions
+that have dirtied each individual page however, so determining which
+file descriptors should get back an error is not possible.
+
+Instead, the generic writeback error tracking infrastructure in the
+kernel settles for reporting errors to fsync on all file descriptions
+that were open at the time that the error occurred.  In a situation with
+multiple writers, all of them will get back an error on a subsequent fsync,
+even if all of the writes done through that particular file descriptor
+succeeded (or even if there were no writes on that file descriptor at all).
+
+Filesystems that wish to use this infrastructure should call
+filemap_set_wb_err to record the error in the address_space when it
+occurs.  Then, at the end of their fsync operation, they should call
+filemap_report_wb_err to ensure that the struct file's error cursor
+has advanced to the correct point in the stream of errors emitted by
+the backing device(s).
+
+Older kernels used a different method for tracking errors, based on flags
+in the address_space. We're currently switching everything over to use
+the infrastructure based on errseq_t values. During the transition,
+filesystem authors will want to also ensure their file_system_type has
+FS_WB_ERRSEQ set in fs_flags to ensure that shared infrastructure is
+aware of the model in use.
 
 struct address_space_operations
 ---
@@ -804,7 +846,8 @@ struct address_space_operations {
 The File Object
 ===
 
-A file object represents a file opened by a process.
+A file object represents a file opened by a process. This is also known
+as an "open file description" in POSIX parlance.
 
 
 struct file_operations
@@ -887,7 +930,8 @@ otherwise noted.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call
+  fsync: called by the fsync(2) system call. Also see the section above
+entitled "Handling errors during writeback".
 
   fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
-- 
2.9.4

[PATCH v5 14/17] ext4: convert to errseq_t based error tracking

2017-05-31 Thread Jeff Layton

Sample the block device inode's errseq_t when opening a file, so we can
catch metadata writeback errors at fsync time. Change ext4_sync_file to
check for data errors first, and then check the blockdev for metadata
errors afterward.

There are also several internal callers of filemap_write_and_wait_* that
check the error code afterward. Convert them to the "_since" variants,
using the file->f_wb_err value as the "since" value. This means passing
file pointers to several functions instead of inode pointers.

Note that because metadata writeback errors are only tracked on a
per-device level, this does mean that we'll end up reporting an error on
all open file descriptors when there is a metadata writeback failure.

Signed-off-by: Jeff Layton 
---
 fs/ext4/dir.c |  8 ++--
 fs/ext4/ext4.h|  8 
 fs/ext4/extents.c | 24 ++--
 fs/ext4/file.c|  5 -
 fs/ext4/fsync.c   | 23 ++-
 fs/ext4/inode.c   | 19 ---
 fs/ext4/ioctl.c   |  9 +
 fs/ext4/super.c   |  9 +
 8 files changed, 68 insertions(+), 37 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index e8b365000d73..6bbb19510f74 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -611,9 +611,13 @@ static int ext4_dx_readdir(struct file *file, struct 
dir_context *ctx)
 
 static int ext4_dir_open(struct inode * inode, struct file * filp)
 {
+   int ret = 0;
+
if (ext4_encrypted_inode(inode))
-   return fscrypt_get_encryption_info(inode) ? -EACCES : 0;
-   return 0;
+   ret = fscrypt_get_encryption_info(inode) ? -EACCES : 0;
+   if (!ret)
+   filp->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return ret;
 }
 
 static int ext4_release_dir(struct inode *inode, struct file *filp)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8e8046104f4d..e3ab27db43d0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2468,12 +2468,12 @@ extern void ext4_clear_inode(struct inode *);
 extern int  ext4_file_getattr(const struct path *, struct kstat *, u32, 
unsigned int);
 extern int  ext4_sync_inode(handle_t *, struct inode *);
 extern void ext4_dirty_inode(struct inode *, int);
-extern int ext4_change_inode_journal_flag(struct inode *, int);
+extern int ext4_change_inode_journal_flag(struct file *, int);
 extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
-extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
+extern int ext4_punch_hole(struct file *file, loff_t offset, loff_t length);
 extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int 
nblocks);
 extern void ext4_set_inode_flags(struct inode *);
 extern int ext4_alloc_da_blocks(struct inode *inode);
@@ -3143,8 +3143,8 @@ extern ext4_lblk_t ext4_ext_next_allocated_block(struct 
ext4_ext_path *path);
 extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
__u64 start, __u64 len);
 extern int ext4_ext_precache(struct inode *inode);
-extern int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len);
-extern int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len);
+extern int ext4_collapse_range(struct file *file, loff_t offset, loff_t len);
+extern int ext4_insert_range(struct file *file, loff_t offset, loff_t len);
 extern int ext4_swap_extents(handle_t *handle, struct inode *inode1,
struct inode *inode2, ext4_lblk_t lblk1,
 ext4_lblk_t lblk2,  ext4_lblk_t count,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2a97dff87b96..7e108fda9ae9 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4934,17 +4934,17 @@ long ext4_fallocate(struct file *file, int mode, loff_t 
offset, loff_t len)
return -EOPNOTSUPP;
 
if (mode & FALLOC_FL_PUNCH_HOLE)
-   return ext4_punch_hole(inode, offset, len);
+   return ext4_punch_hole(file, offset, len);
 
ret = ext4_convert_inline_data(inode);
if (ret)
return ret;
 
if (mode & FALLOC_FL_COLLAPSE_RANGE)
-   return ext4_collapse_range(inode, offset, len);
+   return ext4_collapse_range(file, offset, len);
 
if (mode & FALLOC_FL_INSERT_RANGE)
-   return ext4_insert_range(inode, offset, len);
+   return ext4_insert_range(file, offset, len);
 
if (mode & FALLOC_FL_ZERO_RANGE)
return ext4_zero_range(file, offset, len, mode);
@@ -5444,14 +5444,16 @@ ext4_ext_shift_extents(struct inode *inode, handle_t 
*handle,
  * This implements the fallocate's collapse range functionality for ext4
  * Returns: 0 and non-zero on error.
  */
-int ext4_collapse_range(struct inode *inode,

[PATCH v5 12/17] fs: allow __generic_file_fsync to support both flavors of error reporting

2017-05-31 Thread Jeff Layton

For now, we add a FS_WB_ERRSEQ check to know how to handle it.

Signed-off-by: Jeff Layton 
---
 fs/libfs.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 1dec90819366..2ae58a252718 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -971,10 +971,18 @@ int __generic_file_fsync(struct file *file, loff_t start, 
loff_t end,
 int datasync)
 {
struct inode *inode = file->f_mapping->host;
-   int err;
-   int ret;
-
-   err = filemap_write_and_wait_range(inode->i_mapping, start, end);
+   int err, ret;
+   bool use_errseq = inode->i_sb->s_type->fs_flags & FS_WB_ERRSEQ;
+   errseq_t since;
+
+   if (use_errseq) {
+   since = READ_ONCE(file->f_wb_err);
+   err = filemap_write_and_wait_range_since(inode->i_mapping,
+   start, end, since);
+   } else {
+   err = filemap_write_and_wait_range(inode->i_mapping,
+   start, end);
+   }
if (err)
return err;
 
@@ -988,11 +996,15 @@ int __generic_file_fsync(struct file *file, loff_t start, 
loff_t end,
err = sync_inode_metadata(inode, 1);
if (ret == 0)
ret = err;
-
 out:
inode_unlock(inode);
-   err = filemap_check_errors(inode->i_mapping);
-   return ret ? ret : err;
+   if (ret == 0) {
+   if (use_errseq)
+   err = filemap_check_wb_err(inode->i_mapping, since);
+   else
+   err = filemap_check_errors(inode->i_mapping);
+   }
+   return ret;
 }
 EXPORT_SYMBOL(__generic_file_fsync);
 
-- 
2.9.4

[PATCH v5 09/17] block: convert to errseq_t based writeback error tracking

2017-05-31 Thread Jeff Layton

Fairly straightforward conversion. In fsync, just use the file->f_wb_err
value as a "since" value. At the end, call filemap_report_wb_err to
advance the cursor in it.

Signed-off-by: Jeff Layton 
---
 fs/block_dev.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4d62fe771587..0d5f849e2a18 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -622,11 +622,13 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t 
end, int datasync)
 {
struct inode *bd_inode = bdev_file_inode(filp);
struct block_device *bdev = I_BDEV(bd_inode);
-   int error;
+   int error, wberr;
+   errseq_t since = READ_ONCE(filp->f_wb_err);

-   error = filemap_write_and_wait_range(filp->f_mapping, start, end);
+   error = filemap_write_and_wait_range_since(filp->f_mapping, start,
+   end, since);
if (error)
-   return error;
+   goto out;
 
/*
 * There is no need to serialise calls to blkdev_issue_flush with
@@ -637,6 +639,10 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t 
end, int datasync)
if (error == -EOPNOTSUPP)
error = 0;
 
+out:
+   wberr = filemap_report_wb_err(filp);
+   if (!error)
+   error = wberr;
return error;
 }
 EXPORT_SYMBOL(blkdev_fsync);
@@ -801,6 +807,7 @@ static struct file_system_type bd_type = {
.name   = "bdev",
.mount  = bd_mount,
.kill_sb= kill_anon_super,
+   .fs_flags   = FS_WB_ERRSEQ,
 };
 
 struct super_block *blockdev_superblock __read_mostly;
-- 
2.9.4

[PATCH v5 10/17] block: add sync_blockdev_since and sync_filesystem_since

2017-05-31 Thread Jeff Layton

New variants of sync_filesystem and sync_blockdev.

Signed-off-by: Jeff Layton 
---
 fs/block_dev.c | 15 +++
 fs/internal.h  |  8 
 fs/sync.c  | 45 +
 include/linux/fs.h | 13 -
 4 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0d5f849e2a18..9da613ec1665 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -452,6 +452,15 @@ int __sync_blockdev(struct block_device *bdev, int wait)
return filemap_write_and_wait(bdev->bd_inode->i_mapping);
 }
 
+int __sync_blockdev_since(struct block_device *bdev, int wait, errseq_t since)
+{
+   if (!bdev)
+   return 0;
+   if (!wait)
+   return filemap_flush(bdev->bd_inode->i_mapping);
+   return filemap_write_and_wait_since(bdev->bd_inode->i_mapping, since);
+}
+
 /*
  * Write out and wait upon all the dirty data associated with a block
  * device via its mapping.  Does not take the superblock lock.
@@ -462,6 +471,12 @@ int sync_blockdev(struct block_device *bdev)
 }
 EXPORT_SYMBOL(sync_blockdev);
 
+int sync_blockdev_since(struct block_device *bdev, errseq_t since)
+{
+   return __sync_blockdev_since(bdev, 1, since);
+}
+EXPORT_SYMBOL(sync_blockdev_since);
+
 /*
  * Write out and wait upon all dirty data associated with this
  * device.   Filesystem data as well as the underlying block
diff --git a/fs/internal.h b/fs/internal.h
index 9676fe11c093..234343ba8af7 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -25,6 +25,8 @@ struct shrink_control;
 extern void __init bdev_cache_init(void);
 
 extern int __sync_blockdev(struct block_device *bdev, int wait);
+extern int __sync_blockdev_since(struct block_device *bdev, int wait,
+   errseq_t since);
 
 #else
 static inline void bdev_cache_init(void)
@@ -35,6 +37,12 @@ static inline int __sync_blockdev(struct block_device *bdev, 
int wait)
 {
return 0;
 }
+
+static inline int __sync_blockdev_since(struct block_device *bdev, int wait,
+   errseq_t since)
+{
+   return 0;
+}
 #endif
 
 /*
diff --git a/fs/sync.c b/fs/sync.c
index 819a81526714..2a8202f9eb21 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -71,6 +71,51 @@ int sync_filesystem(struct super_block *sb)
 }
 EXPORT_SYMBOL(sync_filesystem);
 
+static int __sync_filesystem_since(struct super_block *sb, int wait,
+   errseq_t since)
+{
+   int fs_ret = 0, bd_ret;
+
+   if (wait)
+   sync_inodes_sb(sb);
+   else
+   writeback_inodes_sb(sb, WB_REASON_SYNC);
+
+   if (sb->s_op->sync_fs)
+   fs_ret = sb->s_op->sync_fs(sb, wait);
+   bd_ret = __sync_blockdev_since(sb->s_bdev, wait, since);
+
+   return fs_ret ? fs_ret : bd_ret;
+}
+
+/*
+ * Write out and wait upon all dirty data associated with this
+ * superblock.  Filesystem data as well as the underlying block
+ * device.  Takes the superblock lock.
+ */
+int sync_filesystem_since(struct super_block *sb, errseq_t since)
+{
+   int ret;
+
+   /*
+* We need to be protected against the filesystem going from
+* r/o to r/w or vice versa.
+*/
+   WARN_ON(!rwsem_is_locked(>s_umount));
+
+   /*
+* No point in syncing out anything if the filesystem is read-only.
+*/
+   if (sb->s_flags & MS_RDONLY)
+   return 0;
+
+   ret = __sync_filesystem_since(sb, 0, since);
+   if (ret < 0)
+   return ret;
+   return __sync_filesystem_since(sb, 1, since);
+}
+EXPORT_SYMBOL(sync_filesystem_since);
+
 static void sync_inodes_one_sb(struct super_block *sb, void *arg)
 {
if (!(sb->s_flags & MS_RDONLY))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7d1bd3163d99..f483c23866c4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2376,6 +2376,7 @@ extern void bdput(struct block_device *);
 extern void invalidate_bdev(struct block_device *);
 extern void iterate_bdevs(void (*)(struct block_device *, void *), void *);
 extern int sync_blockdev(struct block_device *bdev);
+extern int sync_blockdev_since(struct block_device *bdev, errseq_t since);
 extern void kill_bdev(struct block_device *);
 extern struct super_block *freeze_bdev(struct block_device *);
 extern void emergency_thaw_all(void);
@@ -2390,7 +2391,16 @@ static inline bool sb_is_blkdev_sb(struct super_block 
*sb)
 }
 #else
 static inline void bd_forget(struct inode *inode) {}
-static inline int sync_blockdev(struct block_device *bdev) { return 0; }
+static inline int sync_blockdev(struct block_device *bdev)
+{
+   return 0;
+}
+
+static inline int sync_blockdev_since(struct block_device *bdev,
+   errseq_t since)
+{
+   return 0;
+}
 static inline void kill_bdev(struct block_device *bdev) {}
 static inline void invalidate_bdev(struct

[PATCH v5 15/17] fs: add a write_one_page_since

2017-05-31 Thread Jeff Layton

Allow filesystems to pass in an errseq_t for a since value.

Signed-off-by: Jeff Layton 
---
 include/linux/mm.h  |  2 ++
 mm/page-writeback.c | 53 +
 2 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ca9c8b27cecb..c901d7313374 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mempolicy;
 struct anon_vma;
@@ -2200,6 +2201,7 @@ extern int filemap_page_mkwrite(struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
 int __must_check write_one_page(struct page *page);
+int __must_check write_one_page_since(struct page *page, errseq_t since);
 void task_dirty_inc(struct task_struct *tsk);
 
 /* readahead.c */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e369e8ea2a29..63058e35c60d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2365,19 +2365,10 @@ int do_writepages(struct address_space *mapping, struct 
writeback_control *wbc)
return ret;
 }
 
-/**
- * write_one_page - write out a single page and wait on I/O
- * @page: the page to write
- *
- * The page must be locked by the caller and will be unlocked upon return.
- *
- * Note that the mapping's AS_EIO/AS_ENOSPC flags will be cleared when this
- * function returns.
- */
-int write_one_page(struct page *page)
+static int __write_one_page(struct page *page)
 {
struct address_space *mapping = page->mapping;
-   int ret = 0, ret2;
+   int ret;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 1,
@@ -2394,16 +2385,54 @@ int write_one_page(struct page *page)
wait_on_page_writeback(page);
put_page(page);
} else {
+   ret = 0;
unlock_page(page);
}
+   return ret;
+}
 
+/**
+ * write_one_page - write out a single page and wait on I/O
+ * @page: the page to write
+ *
+ * The page must be locked by the caller and will be unlocked upon return.
+ *
+ * Note that the mapping's AS_EIO/AS_ENOSPC flags will be cleared when this
+ * function returns.
+ */
+int write_one_page(struct page *page)
+{
+   int ret;
+
+   ret = __write_one_page(page);
if (!ret)
-   ret = filemap_check_errors(mapping);
+   ret = filemap_check_errors(page->mapping);
return ret;
 }
 EXPORT_SYMBOL(write_one_page);
 
 /*
+ * write_one_page_since - write out a single page and wait on I/O
+ * @page: the page to write
+ * @since: previously sampled errseq_t
+ *
+ * The page must be locked by the caller and will be unlocked upon return.
+ *
+ * The caller should pass in a previously-sampled errseq_t. The mapping will
+ * be checked for errors since that point.
+ */
+int write_one_page_since(struct page *page, errseq_t since)
+{
+   int ret;
+
+   ret = __write_one_page(page);
+   if (!ret)
+   ret = filemap_check_wb_err(page->mapping, since);
+   return ret;
+}
+EXPORT_SYMBOL(write_one_page_since);
+
+/*
  * For address_spaces which do not use buffers nor write back.
  */
 int __set_page_dirty_no_writeback(struct page *page)
-- 
2.9.4

[PATCH v5 13/17] jbd2: conditionally handle errors using errseq_t based on FS_WB_ERRSEQ flag

2017-05-31 Thread Jeff Layton

Grab the current mapping->wb_err when linking a transaction to the list
and stash it in the journal inode. Then we can use that as a "since"
value when committing it to ensure that there were no writeback errors
since the transaction was started.

We do still need to perform old-style error handling too for now in
journal_finish_inode_data_buffers. jbd2 is shared infrastructure between
several filesystems. Eventually we should be able to remove the flag check
and simplify this function again.

For journal recovery, sample the wb_err early on and then pass that as
the since value to sync_blockdev_since.

Signed-off-by: Jeff Layton 
---
 fs/jbd2/commit.c  | 29 +++--
 fs/jbd2/recovery.c|  5 +++--
 fs/jbd2/transaction.c |  1 +
 include/linux/jbd2.h  |  3 +++
 4 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b6b194ec1b4f..aea71e4bc9be 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -259,21 +259,30 @@ static int journal_finish_inode_data_buffers(journal_t 
*journal,
/* For locking, see the comment in journal_submit_data_buffers() */
spin_lock(>j_list_lock);
list_for_each_entry(jinode, _transaction->t_inode_list, i_list) {
+   struct inode *inode = jinode->i_vfs_inode;
+
if (!(jinode->i_flags & JI_WAIT_DATA))
continue;
jinode->i_flags |= JI_COMMIT_RUNNING;
spin_unlock(>j_list_lock);
-   err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping);
-   if (err) {
-   /*
-* Because AS_EIO is cleared by
-* filemap_fdatawait_range(), set it again so
-* that user process can get -EIO from fsync().
-*/
-   mapping_set_error(jinode->i_vfs_inode->i_mapping, -EIO);
-
-   if (!ret)
+   if (inode->i_sb->s_type->fs_flags & FS_WB_ERRSEQ) {
+   err = filemap_fdatawait_since(inode->i_mapping,
+   jinode->i_since);
+   if (err && !ret)
ret = err;
+   } else {
+   err = filemap_fdatawait(inode->i_mapping);
+   if (err) {
+   /*
+* Because AS_EIO is cleared by
+* filemap_fdatawait_range(), we must set it 
again so
+* that user process can get -EIO from fsync() 
if
+* non-errseq_t based error tracking is in play.
+*/
+   mapping_set_error(inode->i_mapping, -EIO);
+   if (!ret)
+   ret = err;
+   }
}
spin_lock(>j_list_lock);
jinode->i_flags &= ~JI_COMMIT_RUNNING;
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 02dd3360cb20..06a8ee71848c 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -248,11 +248,12 @@ int jbd2_journal_recover(journal_t *journal)
 {
int err, err2;
journal_superblock_t *  sb;
-
struct recovery_infoinfo;
+   errseq_tsince;
 
memset(, 0, sizeof(info));
sb = journal->j_superblock;
+   since = filemap_sample_wb_err(journal->j_fs_dev->bd_inode->i_mapping);
 
/*
 * The journal superblock's s_start field (the current log head)
@@ -284,7 +285,7 @@ int jbd2_journal_recover(journal_t *journal)
journal->j_transaction_sequence = ++info.end_transaction;
 
jbd2_journal_clear_revoke(journal);
-   err2 = sync_blockdev(journal->j_fs_dev);
+   err2 = sync_blockdev_since(journal->j_fs_dev, since);
if (!err)
err = err2;
/* Make sure all replayed data is on permanent storage */
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 9ee4832b6f8b..e9e6af20a087 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2535,6 +2535,7 @@ static int jbd2_journal_file_inode(handle_t *handle, 
struct jbd2_inode *jinode,
/* Not on any transaction list... */
J_ASSERT(!jinode->i_next_transaction);
jinode->i_transaction = transaction;
+   jinode->i_since = filemap_sample_wb_err(jinode->i_vfs_inode->i_mapping);
list_add(>i_list, >t_inode_list);
 done:
spin_unlock(>j_list_lock);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 606b6bce3a5b..b6901eac2d8e 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -439,6 +439,9 @@ struct jbd2_inode {
 
/* Flags of inode [j_list_lock] */
unsigned long i_flags;
+
+   /* Sampled writeback error at the time of transaction

[PATCH v5 11/17] fs: add f_md_wb_err field to struct file for tracking metadata errors

2017-05-31 Thread Jeff Layton

Some filesystems (particularly local ones) keep a different mapping for
metadata writeback. Add a second errseq_t to struct file for tracking
metadata writeback errors. Also add a new function for checking a
mapping of the caller's choosing vs. the f_md_wb_err value.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h |  3 +++
 include/trace/events/filemap.h | 23 ++-
 mm/filemap.c   | 40 +++-
 3 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f483c23866c4..df1d68e3605a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -871,6 +871,7 @@ struct file {
struct list_headf_tfile_llink;
 #endif /* #ifdef CONFIG_EPOLL */
struct address_space*f_mapping;
+   errseq_tf_md_wb_err; /* optional metadata wb error 
tracking */
 } __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
 
 struct file_handle {
@@ -2549,6 +2550,8 @@ extern int filemap_fdatawrite_range(struct address_space 
*mapping,
 extern int filemap_check_errors(struct address_space *mapping);
 
 extern int __must_check filemap_report_wb_err(struct file *file);
+extern int __must_check filemap_report_md_wb_err(struct file *file,
+   struct address_space *mapping);
 extern void __filemap_set_wb_err(struct address_space *mapping, int err);
 
 /**
diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 2af66920f267..6e0d78c01a2e 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -79,12 +79,11 @@ TRACE_EVENT(filemap_set_wb_err,
 );
 
 TRACE_EVENT(filemap_report_wb_err,
-   TP_PROTO(struct file *file, errseq_t old),
+   TP_PROTO(struct address_space *mapping, errseq_t old, errseq_t 
new),
 
-   TP_ARGS(file, old),
+   TP_ARGS(mapping, old, new),
 
TP_STRUCT__entry(
-   __field(struct file *, file);
__field(unsigned long, i_ino)
__field(dev_t, s_dev)
__field(errseq_t, old)
@@ -92,20 +91,18 @@ TRACE_EVENT(filemap_report_wb_err,
),
 
TP_fast_assign(
-   __entry->file = file;
-   __entry->i_ino = file->f_mapping->host->i_ino;
-   if (file->f_mapping->host->i_sb)
-   __entry->s_dev = 
file->f_mapping->host->i_sb->s_dev;
+   __entry->i_ino = mapping->host->i_ino;
+   if (mapping->host->i_sb)
+   __entry->s_dev = mapping->host->i_sb->s_dev;
else
-   __entry->s_dev = file->f_mapping->host->i_rdev;
+   __entry->s_dev = mapping->host->i_rdev;
__entry->old = old;
-   __entry->new = file->f_wb_err;
+   __entry->new = new;
),
 
-   TP_printk("file=%p dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
-   __entry->file, MAJOR(__entry->s_dev),
-   MINOR(__entry->s_dev), __entry->i_ino, __entry->old,
-   __entry->new)
+   TP_printk("dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
+   MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
+   __entry->i_ino, __entry->old, __entry->new)
 );
 #endif /* _TRACE_FILEMAP_H */
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 38a14dc825ad..0edf0234973e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -631,21 +631,20 @@ EXPORT_SYMBOL(__filemap_set_wb_err);
  * value is protected by the f_lock since we must ensure that it reflects
  * the latest value swapped in for this file descriptor.
  */
-int filemap_report_wb_err(struct file *file)
+static int __filemap_report_wb_err(errseq_t *cursor, spinlock_t *lock,
+   struct address_space *mapping)
 {
int err = 0;
-   errseq_t old = READ_ONCE(file->f_wb_err);
-   struct address_space *mapping = file->f_mapping;
+   errseq_t old = READ_ONCE(*cursor);
 
/* Locklessly handle the common case where nothing has changed */
if (errseq_check(>wb_err, old)) {
/* Something changed, must use slow path */
-   spin_lock(>f_lock);
-   old = file->f_wb_err;
-   err = errseq_check_and_advance(>wb_err,
-   >f_wb_err);
-   trace_filemap_report_wb_err(file, old);
-   spin_unlock(>f_lock);
+   spin_lock(lock);
+   old = *cursor;
+   err = errseq_check_and_advance(>wb_err, cursor);
+   trace_filemap_report_wb_err(mapping, old, *cursor);
+

[PATCH v5 16/17] ext2: convert to errseq_t based writeback error tracking

2017-05-31 Thread Jeff Layton

Set the flag to indicate that we want new-style data writeback error
handling.

This means that we need to override the open routines for files and
directories so that we can sample the bdev wb_err at open.

XXX: doesn't quite pass the xfstest for this currently, as ext2_error
 resets the error on the device inode on every call.

Signed-off-by: Jeff Layton 
---
 fs/ext2/dir.c   |  8 
 fs/ext2/file.c  | 29 +++--
 fs/ext2/super.c |  2 +-
 3 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index e2709695b177..6e476c9929f8 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -713,6 +713,13 @@ int ext2_empty_dir (struct inode * inode)
return 0;
 }
 
+static int ext2_dir_open(struct inode *inode, struct file *file)
+{
+   /* Sample blockdev mapping errseq_t for metadata writeback */
+   file->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return 0;
+}
+
 const struct file_operations ext2_dir_operations = {
.llseek = generic_file_llseek,
.read   = generic_read_dir,
@@ -721,5 +728,6 @@ const struct file_operations ext2_dir_operations = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext2_compat_ioctl,
 #endif
+   .open   = ext2_dir_open,
.fsync  = ext2_fsync,
 };
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ed00e7ae0ef3..6f3cd7bc3fb3 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -172,16 +172,23 @@ static int ext2_release_file (struct inode * inode, 
struct file * filp)
 
 int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 {
-   int ret;
+   int ret, ret2;
struct super_block *sb = file->f_mapping->host->i_sb;
struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
 
ret = generic_file_fsync(file, start, end, datasync);
-   if (ret == -EIO) {
-   /* We don't really know where the IO error happened... */
-   ext2_error(sb, __func__,
+
+   ret2 = filemap_report_wb_err(file);
+   if (ret == 0)
+   ret = ret2;
+
+   ret2 = filemap_report_md_wb_err(file, mapping);
+   if (ret2) {
+   if (ret == 0)
+   ret = ret2;
+   if (ret == -EIO)
+   ext2_error(sb, __func__,
   "detected IO error when writing metadata buffers");
-   ret = -EIO;
}
return ret;
 }
@@ -204,6 +211,16 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
return generic_file_write_iter(iocb, from);
 }
 
+static int ext2_file_open(struct inode *inode, struct file *file)
+{
+   int ret;
+
+   ret = dquot_file_open(inode, file);
+   if (likely(ret == 0))
+   file->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return ret;
+}
+
 const struct file_operations ext2_file_operations = {
.llseek = generic_file_llseek,
.read_iter  = ext2_file_read_iter,
@@ -213,7 +230,7 @@ const struct file_operations ext2_file_operations = {
.compat_ioctl   = ext2_compat_ioctl,
 #endif
.mmap   = ext2_file_mmap,
-   .open   = dquot_file_open,
+   .open   = ext2_file_open,
.release= ext2_release_file,
.fsync  = ext2_fsync,
.get_unmapped_area = thp_get_unmapped_area,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 9c2028b50e5c..dd37d7f955bf 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1629,7 +1629,7 @@ static struct file_system_type ext2_fs_type = {
.name   = "ext2",
.mount  = ext2_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV|FS_WB_ERRSEQ,
 };
 MODULE_ALIAS_FS("ext2");
 
-- 
2.9.4

[PATCH v5 07/17] mm: add filemap_fdatawait_range_since and filemap_write_and_wait_range_since

2017-05-31 Thread Jeff Layton

Add new filemap_*wait* variants that take a "since" value and return an
error if one occurred since that sample point.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h |  9 
 mm/filemap.c   | 67 ++
 2 files changed, 76 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2f3bcf4eb73b..7d1bd3163d99 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2516,12 +2516,21 @@ extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
 extern int filemap_fdatawait(struct address_space *);
+extern int filemap_fdatawait_since(struct address_space *, errseq_t);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
   loff_t lend);
+extern int filemap_fdatawait_range_since(struct address_space *mapping,
+  loff_t start_byte, loff_t end_byte,
+  errseq_t since);
 extern int filemap_write_and_wait(struct address_space *mapping);
+extern int filemap_write_and_wait_since(struct address_space *mapping,
+   errseq_t since);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
+extern int filemap_write_and_wait_range_since(struct address_space *mapping,
+  loff_t start_byte, loff_t end_byte,
+  errseq_t since);
 extern int __filemap_fdatawrite_range(struct address_space *mapping,
loff_t start, loff_t end, int sync_mode);
 extern int filemap_fdatawrite_range(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index 97dc28f853fc..38a14dc825ad 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -431,6 +431,14 @@ int filemap_fdatawait_range(struct address_space *mapping, 
loff_t start_byte,
 }
 EXPORT_SYMBOL(filemap_fdatawait_range);
 
+int filemap_fdatawait_range_since(struct address_space *mapping, loff_t 
start_byte,
+ loff_t end_byte, errseq_t since)
+{
+   __filemap_fdatawait_range(mapping, start_byte, end_byte);
+   return filemap_check_wb_err(mapping, since);
+}
+EXPORT_SYMBOL(filemap_fdatawait_range_since);
+
 /**
  * filemap_fdatawait_keep_errors - wait for writeback without clearing errors
  * @mapping: address space structure to wait for
@@ -476,6 +484,17 @@ int filemap_fdatawait(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_fdatawait);
 
+int filemap_fdatawait_since(struct address_space *mapping, errseq_t since)
+{
+   loff_t i_size = i_size_read(mapping->host);
+
+   if (i_size == 0)
+   return 0;
+
+   return filemap_fdatawait_range_since(mapping, 0, i_size - 1, since);
+}
+EXPORT_SYMBOL(filemap_fdatawait_since);
+
 int filemap_write_and_wait(struct address_space *mapping)
 {
int err = 0;
@@ -501,6 +520,31 @@ int filemap_write_and_wait(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_write_and_wait);
 
+int filemap_write_and_wait_since(struct address_space *mapping, errseq_t since)
+{
+   int err = 0;
+
+   if ((!dax_mapping(mapping) && mapping->nrpages) ||
+   (dax_mapping(mapping) && mapping->nrexceptional)) {
+   err = filemap_fdatawrite(mapping);
+   /*
+* Even if the above returned error, the pages may be
+* written partially (e.g. -ENOSPC), so we wait for it.
+* But the -EIO is special case, it may indicate the worst
+* thing (e.g. bug) happened, so we avoid waiting for it.
+*/
+   if (err != -EIO) {
+   int err2 = filemap_fdatawait_since(mapping, since);
+   if (!err)
+   err = err2;
+   }
+   } else {
+   err = filemap_check_wb_err(mapping, since);
+   }
+   return err;
+}
+EXPORT_SYMBOL(filemap_write_and_wait_since);
+
 /**
  * filemap_write_and_wait_range - write out & wait on a file range
  * @mapping:   the address_space for the pages
@@ -535,6 +579,29 @@ int filemap_write_and_wait_range(struct address_space 
*mapping,
 }
 EXPORT_SYMBOL(filemap_write_and_wait_range);
 
+int filemap_write_and_wait_range_since(struct address_space *mapping,
+loff_t lstart, loff_t lend, errseq_t since)
+{
+   int err = 0;
+
+   if ((!dax_mapping(mapping) && mapping->nrpages) ||
+   (dax_mapping(mapping) && mapping->nrexceptional)) {
+   err = __filemap_fdatawrite_range(mapping, lstart, lend,
+WB_SYNC_ALL);
+   /* See comment of filemap_write_and_wait() */
+

[PATCH v5 17/17] fs: convert ext2 to use write_one_page_since

2017-05-31 Thread Jeff Layton

Sample the wb_err before changing the directory, so that we can catch
errors that occur since that point.

Signed-off-by: Jeff Layton 
---
 fs/ext2/dir.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 6e476c9929f8..073f096ac5e6 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -85,7 +85,8 @@ ext2_last_byte(struct inode *inode, unsigned long page_nr)
return last_byte;
 }
 
-static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
+static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len,
+   errseq_t since)
 {
struct address_space *mapping = page->mapping;
struct inode *dir = mapping->host;
@@ -100,7 +101,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
}
 
if (IS_DIRSYNC(dir)) {
-   err = write_one_page(page);
+   err = write_one_page_since(page, since);
if (!err)
err = sync_inode_metadata(dir, 1);
} else {
@@ -462,13 +463,14 @@ void ext2_set_link(struct inode *dir, struct 
ext2_dir_entry_2 *de,
(char *) de - (char *) page_address(page);
unsigned len = ext2_rec_len_from_disk(de->rec_len);
int err;
+   errseq_t since = filemap_sample_wb_err(dir->i_mapping);
 
lock_page(page);
err = ext2_prepare_chunk(page, pos, len);
BUG_ON(err);
de->inode = cpu_to_le32(inode->i_ino);
ext2_set_de_type(de, inode);
-   err = ext2_commit_chunk(page, pos, len);
+   err = ext2_commit_chunk(page, pos, len, since);
ext2_put_page(page);
if (update_times)
dir->i_mtime = dir->i_ctime = current_time(dir);
@@ -494,6 +496,7 @@ int ext2_add_link (struct dentry *dentry, struct inode 
*inode)
char *kaddr;
loff_t pos;
int err;
+   errseq_t since = filemap_sample_wb_err(dir->i_mapping);
 
/*
 * We take care of directory expansion in the same loop.
@@ -560,7 +563,7 @@ int ext2_add_link (struct dentry *dentry, struct inode 
*inode)
memcpy(de->name, name, namelen);
de->inode = cpu_to_le32(inode->i_ino);
ext2_set_de_type (de, inode);
-   err = ext2_commit_chunk(page, pos, rec_len);
+   err = ext2_commit_chunk(page, pos, rec_len, since);
dir->i_mtime = dir->i_ctime = current_time(dir);
EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL;
mark_inode_dirty(dir);
@@ -589,6 +592,7 @@ int ext2_delete_entry (struct ext2_dir_entry_2 * dir, 
struct page * page )
ext2_dirent * pde = NULL;
ext2_dirent * de = (ext2_dirent *) (kaddr + from);
int err;
+   errseq_t since = filemap_sample_wb_err(inode->i_mapping);
 
while ((char*)de < (char*)dir) {
if (de->rec_len == 0) {
@@ -609,7 +613,7 @@ int ext2_delete_entry (struct ext2_dir_entry_2 * dir, 
struct page * page )
if (pde)
pde->rec_len = ext2_rec_len_to_disk(to - from);
dir->inode = 0;
-   err = ext2_commit_chunk(page, pos, to - from);
+   err = ext2_commit_chunk(page, pos, to - from, since);
inode->i_ctime = inode->i_mtime = current_time(inode);
EXT2_I(inode)->i_flags &= ~EXT2_BTREE_FL;
mark_inode_dirty(inode);
@@ -628,6 +632,7 @@ int ext2_make_empty(struct inode *inode, struct inode 
*parent)
struct ext2_dir_entry_2 * de;
int err;
void *kaddr;
+   errseq_t since = filemap_sample_wb_err(inode->i_mapping);
 
if (!page)
return -ENOMEM;
@@ -653,7 +658,7 @@ int ext2_make_empty(struct inode *inode, struct inode 
*parent)
memcpy (de->name, "..\0", 4);
ext2_set_de_type (de, inode);
kunmap_atomic(kaddr);
-   err = ext2_commit_chunk(page, 0, chunk_size);
+   err = ext2_commit_chunk(page, 0, chunk_size, since);
 fail:
put_page(page);
return err;
-- 
2.9.4

[PATCH v5 00/17] fs: introduce new writeback error reporting and convert ext2 and ext4 to use it

2017-05-31 Thread Jeff Layton

v5: don't retrofit old API over the new infrastructure
add fstype flag to indicate how wb errors are tracked within that fs
add more function variants that take a errseq_t "since" value
add second errseq_t to struct file to track metadata wb errors
convert ext4 and ext2 to use the new APIs

v4: several more cleanup patches
documentation and kerneldoc comment updates
fix bugs in gfs2 patches
make sync_file_range use same error reporting semantics
bugfixes in buffer.c
convert nfs to new scheme (maybe bogus, can be dropped)

v3: wb_err_t -> errseq_t conversion
clean up places that re-set errors after calling filemap_* functions

v2: introduce wb_err_t, use atomics

This is v5 of the patchset to improve how we're tracking and reporting
errors that occur during pagecache writeback. The main difference in
this set from the last one is that I've stopped trying to retrofit the
old error tracking API on top of the new one. This is more work since
we'll have to touch each fs individually, but should be safer as the
"since" values used for checking errors will be more deliberate.

There are several situations where the kernel can "lose" errors that
occur during writeback, such that fsync will return success even
though it failed to write back some data previously. The basic idea
here is to have the kernel be more deliberate about the point from
which errors are checked to ensure that that doesn't happen.

An additional aim of this set is to change the behavior of fsync in
Linux to report writeback errors on all fds instead of just the first
one. This allows writers to reliably tell whether their data made it to
the backing device without having to coordinate fsync calls with other
writers.

To do this, we add a new typedef: errseq_t. This is a 32-bit value
that can store an error code, and a sequence number so we can tell
whether it has changed since we last sampled it. This allows us to
record errors in the address_space and then report those errors only
once per file description.

This set just alters block device files, ext4 and the legacy ext2
driver. If this general approach seems acceptable, then I'll start
converting other filesystems in follow-on patchsets. I'd also like
to get this into linux-next as soon as possible to ensure that we're
banging out any bugs that might be lurking here.

I also have a couple of xfstests for this as well that I'll re-post
soon.

Jeff Layton (17):
  lib: add errseq_t type and infrastructure for handling it
  fs: new infrastructure for writeback error handling and reporting
  mm: tracepoints for writeback error events
  fs: add a new fstype flag to indicate how writeback errors are tracked
  Documentation: flesh out the section in vfs.txt on storing and
reporting writeback errors
  fs: adapt sync_file_range to new reporting infrastructure
  mm: add filemap_fdatawait_range_since and
filemap_write_and_wait_range_since
  dax: set errors in mapping when writeback fails
  block: convert to errseq_t based writeback error tracking
  block: add sync_blockdev_since and sync_filesystem_since
  fs: add f_md_wb_err field to struct file for tracking metadata errors
  fs: allow __generic_file_fsync to support both flavors of error
reporting
  jbd2: conditionally handle errors using errseq_t based on FS_WB_ERRSEQ
flag
  ext4: convert to errseq_t based error tracking
  fs: add a write_one_page_since
  ext2: convert to errseq_t based writeback error tracking
  fs: convert ext2 to use write_one_page_since

 Documentation/filesystems/vfs.txt |  50 -
 drivers/dax/device.c  |   1 +
 fs/block_dev.c|  29 +-
 fs/dax.c  |  18 +++-
 fs/ext2/dir.c |  25 +++--
 fs/ext2/file.c|  29 --
 fs/ext2/super.c   |   2 +-
 fs/ext4/dir.c |   8 +-
 fs/ext4/ext4.h|   8 +-
 fs/ext4/extents.c |  24 +++--
 fs/ext4/file.c|   5 +-
 fs/ext4/fsync.c   |  23 -
 fs/ext4/inode.c   |  19 ++--
 fs/ext4/ioctl.c   |   9 +-
 fs/ext4/super.c   |   9 +-
 fs/file_table.c   |   1 +
 fs/internal.h |   8 ++
 fs/jbd2/commit.c  |  29 --
 fs/jbd2/recovery.c|   5 +-
 fs/jbd2/transaction.c |   1 +
 fs/libfs.c|  26 +++--
 fs/open.c |   3 +
 fs/sync.c |  62 +++-
 include/linux/errseq.h|  19 
 include/linux/fs.h|  82 ++-
 include/linux/jbd2.h  |   3 +
 include/linux/mm.h|   2 +
 include/linux/pagemap.h   |  32 --
 include/trace/events/filemap.h|  52 ++
 lib/Makefile  |   2 +-
 lib/errseq.c  | 208

[PATCH v5 03/17] mm: tracepoints for writeback error events

2017-05-31 Thread Jeff Layton

To enable that, make __errseq_set return the value that it was set to
we exit the loop. Take heed that that value is not suitable as a later
"since" value, as it will not have been marked seen.

Signed-off-by: Jeff Layton 
---
 include/linux/errseq.h |  2 +-
 include/linux/fs.h |  5 +++-
 include/trace/events/filemap.h | 55 ++
 lib/errseq.c   | 20 ++-
 mm/filemap.c   | 13 +-
 5 files changed, 86 insertions(+), 9 deletions(-)

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
index 0d2555f310cd..9e0d444ac88d 100644
--- a/include/linux/errseq.h
+++ b/include/linux/errseq.h
@@ -5,7 +5,7 @@
 
 typedef u32errseq_t;
 
-void __errseq_set(errseq_t *eseq, int err);
+errseq_t __errseq_set(errseq_t *eseq, int err);
 static inline void errseq_set(errseq_t *eseq, int err)
 {
/* Optimize for the common case of no error */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 24178107379d..293cbc7f3520 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2528,6 +2528,7 @@ extern int filemap_fdatawrite_range(struct address_space 
*mapping,
 extern int filemap_check_errors(struct address_space *mapping);
 
 extern int __must_check filemap_report_wb_err(struct file *file);
+extern void __filemap_set_wb_err(struct address_space *mapping, int err);
 
 /**
  * filemap_set_wb_err - set a writeback error on an address_space
@@ -2547,7 +2548,9 @@ extern int __must_check filemap_report_wb_err(struct file 
*file);
  */
 static inline void filemap_set_wb_err(struct address_space *mapping, int err)
 {
-   errseq_set(>wb_err, err);
+   /* Fastpath for common case of no error */
+   if (unlikely(err))
+   __filemap_set_wb_err(mapping, err);
 }
 
 /**
diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 42febb6bc1d5..2af66920f267 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 
@@ -52,6 +53,60 @@ DEFINE_EVENT(mm_filemap_op_page_cache, 
mm_filemap_add_to_page_cache,
TP_ARGS(page)
);
 
+TRACE_EVENT(filemap_set_wb_err,
+   TP_PROTO(struct address_space *mapping, errseq_t eseq),
+
+   TP_ARGS(mapping, eseq),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, i_ino)
+   __field(dev_t, s_dev)
+   __field(errseq_t, errseq)
+   ),
+
+   TP_fast_assign(
+   __entry->i_ino = mapping->host->i_ino;
+   __entry->errseq = eseq;
+   if (mapping->host->i_sb)
+   __entry->s_dev = mapping->host->i_sb->s_dev;
+   else
+   __entry->s_dev = mapping->host->i_rdev;
+   ),
+
+   TP_printk("dev=%d:%d ino=0x%lx errseq=0x%x",
+   MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
+   __entry->i_ino, __entry->errseq)
+);
+
+TRACE_EVENT(filemap_report_wb_err,
+   TP_PROTO(struct file *file, errseq_t old),
+
+   TP_ARGS(file, old),
+
+   TP_STRUCT__entry(
+   __field(struct file *, file);
+   __field(unsigned long, i_ino)
+   __field(dev_t, s_dev)
+   __field(errseq_t, old)
+   __field(errseq_t, new)
+   ),
+
+   TP_fast_assign(
+   __entry->file = file;
+   __entry->i_ino = file->f_mapping->host->i_ino;
+   if (file->f_mapping->host->i_sb)
+   __entry->s_dev = 
file->f_mapping->host->i_sb->s_dev;
+   else
+   __entry->s_dev = file->f_mapping->host->i_rdev;
+   __entry->old = old;
+   __entry->new = file->f_wb_err;
+   ),
+
+   TP_printk("file=%p dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
+   __entry->file, MAJOR(__entry->s_dev),
+   MINOR(__entry->s_dev), __entry->i_ino, __entry->old,
+   __entry->new)
+);
 #endif /* _TRACE_FILEMAP_H */
 
 /* This part must be outside protection */
diff --git a/lib/errseq.c b/lib/errseq.c
index d129c0611c1f..009972d3000c 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -52,10 +52,14 @@
  *
  * Most callers will want to use the errseq_set inline wrapper to efficiently
  * handle the common case where err is 0.
+ *
+ * We do return an errseq_t here, primarily for debugging purposes. The return
+ * value should not be used as a previously sampled value in later calls as it
+ * will not have the SEEN flag set.
  */
-void __errseq_set(errseq_t

[PATCH v3 9/9] Revert "blk-mq: don't use sync workqueue flushing from drivers"

2017-05-31 Thread Ming Lei

This patch reverts commit 2719aa217e0d02(blk-mq: don't use
sync workqueue flushing from drivers) because only
blk_mq_quiesce_queue() need the sync flush, and now
we don't need to stop queue any more, so revert it.

Also changes to cancel_delayed_work() in blk_mq_stop_hw_queue().

Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 30 --
 1 file changed, 8 insertions(+), 22 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 352aaaf0bcf9..6be10102b343 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -42,7 +42,6 @@ static LIST_HEAD(all_q_list);
 
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
-static void __blk_mq_stop_hw_queues(struct request_queue *q, bool sync);
 
 static int blk_mq_poll_stats_bkt(const struct request *rq)
 {
@@ -1177,16 +1176,6 @@ bool blk_mq_queue_stopped(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_mq_queue_stopped);
 
-static void __blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx, bool sync)
-{
-   if (sync)
-   cancel_delayed_work_sync(>run_work);
-   else
-   cancel_delayed_work(>run_work);
-
-   set_bit(BLK_MQ_S_STOPPED, >state);
-}
-
 /*
  * We do not guarantee that dispatch can be drained or blocked
  * after blk_mq_stop_hw_queue() returns. Please use
@@ -1194,18 +1183,11 @@ static void __blk_mq_stop_hw_queue(struct blk_mq_hw_ctx 
*hctx, bool sync)
  */
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
 {
-   __blk_mq_stop_hw_queue(hctx, false);
-}
-EXPORT_SYMBOL(blk_mq_stop_hw_queue);
+   cancel_delayed_work(>run_work);
 
-static void __blk_mq_stop_hw_queues(struct request_queue *q, bool sync)
-{
-   struct blk_mq_hw_ctx *hctx;
-   int i;
-
-   queue_for_each_hw_ctx(q, hctx, i)
-   __blk_mq_stop_hw_queue(hctx, sync);
+   set_bit(BLK_MQ_S_STOPPED, >state);
 }
+EXPORT_SYMBOL(blk_mq_stop_hw_queue);
 
 /*
  * We do not guarantee that dispatch can be drained or blocked
@@ -1214,7 +1196,11 @@ static void __blk_mq_stop_hw_queues(struct request_queue 
*q, bool sync)
  */
 void blk_mq_stop_hw_queues(struct request_queue *q)
 {
-   __blk_mq_stop_hw_queues(q, false);
+   struct blk_mq_hw_ctx *hctx;
+   int i;
+
+   queue_for_each_hw_ctx(q, hctx, i)
+   blk_mq_stop_hw_queue(hctx);
 }
 EXPORT_SYMBOL(blk_mq_stop_hw_queues);
 
-- 
2.9.4

[PATCH v3 8/9] blk-mq: clarify dispatch may not be drained/blocked by stopping queue

2017-05-31 Thread Ming Lei

BLK_MQ_S_STOPPED may not be observed in other concurrent I/O paths,
we can't guarantee that dispatching won't happen after returning
from the APIs of stopping queue.

So clarify the fact and avoid potential misuse.

Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bcff1b184bbb..352aaaf0bcf9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1187,6 +1187,11 @@ static void __blk_mq_stop_hw_queue(struct blk_mq_hw_ctx 
*hctx, bool sync)
set_bit(BLK_MQ_S_STOPPED, >state);
 }
 
+/*
+ * We do not guarantee that dispatch can be drained or blocked
+ * after blk_mq_stop_hw_queue() returns. Please use
+ * blk_mq_quiesce_queue() for that requirement.
+ */
 void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
 {
__blk_mq_stop_hw_queue(hctx, false);
@@ -1202,6 +1207,11 @@ static void __blk_mq_stop_hw_queues(struct request_queue 
*q, bool sync)
__blk_mq_stop_hw_queue(hctx, sync);
 }
 
+/*
+ * We do not guarantee that dispatch can be drained or blocked
+ * after blk_mq_stop_hw_queues() returns. Please use
+ * blk_mq_quiesce_queue() for that requirement.
+ */
 void blk_mq_stop_hw_queues(struct request_queue *q)
 {
__blk_mq_stop_hw_queues(q, false);
-- 
2.9.4

[PATCH v3 7/9] blk-mq: don't stop queue for quiescing

2017-05-31 Thread Ming Lei

Queue can be started by other blk-mq APIs and can be used in
different cases, this limits uses of blk_mq_quiesce_queue()
if it is based on stopping queue, and make its usage very
difficult, especially users have to use the stop queue APIs
carefully for avoiding to break blk_mq_quiesce_queue().

We have applied the QUIESCED flag for draining and blocking
dispatch, so it isn't necessary to stop queue any more.

After stopping queue is removed, blk_mq_quiesce_queue() can
be used safely and easily, then users won't worry about queue
restarting during quiescing at all.

Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1faddaf005e2..bcff1b184bbb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -169,8 +169,6 @@ void blk_mq_quiesce_queue(struct request_queue *q)
unsigned int i;
bool rcu = false;
 
-   __blk_mq_stop_hw_queues(q, true);
-
spin_lock_irq(q->queue_lock);
queue_flag_set(QUEUE_FLAG_QUIESCED, q);
spin_unlock_irq(q->queue_lock);
@@ -199,8 +197,6 @@ void blk_mq_unquiesce_queue(struct request_queue *q)
queue_flag_clear(QUEUE_FLAG_QUIESCED, q);
spin_unlock_irq(q->queue_lock);
 
-   blk_mq_start_stopped_hw_queues(q, true);
-
wake_up_all(>quiesce_wq);
 }
 EXPORT_SYMBOL_GPL(blk_mq_unquiesce_queue);
-- 
2.9.4

[PATCH v3 4/9] nvme: host: unquiesce queue in nvme_kill_queues()

2017-05-31 Thread Ming Lei

When nvme_kill_queues() is run, queues may be in
quiesced state, so we forcibly unquiesce queues to avoid
blocking dispatch, and I/O hang can be avoided in
remove path.

Peviously we use blk_mq_start_stopped_hw_queues() as
counterpart of blk_mq_quiesce_queue(), now we have
introduced blk_mq_unquiesce_queue(), so use it explicitly.

Cc: linux-n...@lists.infradead.org
Signed-off-by: Ming Lei 
---
 drivers/nvme/host/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c3f189e54d10..e44326d5cf19 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2448,6 +2448,9 @@ void nvme_kill_queues(struct nvme_ctrl *ctrl)
revalidate_disk(ns->disk);
blk_set_queue_dying(ns->queue);
 
+   /* Forcibly unquiesce queues to avoid blocking dispatch */
+   blk_mq_unquiesce_queue(ns->queue);
+
/*
 * Forcibly start all queues to avoid having stuck requests.
 * Note that we must ensure the queues are not stopped
-- 
2.9.4

[PATCH v3 2/9] block: introduce flag of QUEUE_FLAG_QUIESCED

2017-05-31 Thread Ming Lei

This flag is introduced for improving the quiescing code.

Signed-off-by: Ming Lei 
---
 include/linux/blkdev.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 41291be82ac4..60967797f4f6 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -618,6 +618,7 @@ struct request_queue {
 #define QUEUE_FLAG_STATS   27  /* track rq completion times */
 #define QUEUE_FLAG_POLL_STATS  28  /* collecting stats for hybrid polling 
*/
 #define QUEUE_FLAG_REGISTERED  29  /* queue has been registered to a disk 
*/
+#define QUEUE_FLAG_QUIESCED30  /* queue has been quiesced */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_STACKABLE)|   \
@@ -712,6 +713,7 @@ static inline void queue_flag_clear(unsigned int flag, 
struct request_queue *q)
 #define blk_noretry_request(rq) \
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
 REQ_FAILFAST_DRIVER))
+#define blk_queue_quiesced(q)  test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
 
 static inline bool blk_account_rq(struct request *rq)
 {
-- 
2.9.4

[PATCH v3 0/8] blk-mq: fix & improve queue quiescing

2017-05-31 Thread Ming Lei

There is one big issue in current blk_mq_quiesce_queue():

- in case of direct issue or BLK_MQ_S_START_ON_RUN, dispatch won't
be prevented after blk_mq_quiesce_queue() is returned.

It is observed that request double-free/use-after-free
can be triggered easily when canceling NVMe requests via
blk_mq_tagset_busy_iter(...nvme_cancel_request) in nvme_dev_disable().
The reason is that blk_mq_quiesce_queue() can't prevent
dispatching from being run during the period.

Actually we have to quiesce queue for canceling dispatched
requests via blk_mq_tagset_busy_iter(), otherwise use-after-free
can be made easily. This way of canceling dispatched requests
has been used in several drivers, only NVMe uses blk_mq_quiesce_queue()
to avoid the issue, and others need to be fixed too. And it
should be a common way for handling dead controller.

blk_mq_quiesce_queue() is implemented via stopping queue, which
limits its uses, and easy to cause race, because any queue restart in
other paths may break blk_mq_quiesce_queue(). For example, we sometimes
stops queue when hw can't handle too many ongoing requests and restarts
queue after requests are completed. Meantime when we want to cancel
requests if hardware is dead or suspend request is received, quiescing
has to be run first, then the restarting in complete path can break the
quiescing. This patch improves this interface via removing stopping queue,
then it can be easier to use.

V3:
- wait until queue becomes unquiesced in direct issue path, so
we can avoid to queue the current req into sw queue or scheduler
queue, then the state of STOPPED needn't to be touched
- move checking of !blk_queue_quiesced() into 
blk_mq_sched_dispatch_requests()
as suggested by Bart
- NVMe: unquiesce queue in nvme_kill_queues()
- misc changes(fix grammer issue in commit log or comment, ...)

V2:
- split patch "blk-mq: fix blk_mq_quiesce_queue" into two and
fix one build issue when only applying the 1st two patches.
- add kernel oops and hang log into commit log
- add 'Revert "blk-mq: don't use sync workqueue flushing from drivers"'

Ming Lei (9):
  blk-mq: introduce blk_mq_unquiesce_queue
  block: introduce flag of QUEUE_FLAG_QUIESCED
  blk-mq: use the introduced blk_mq_unquiesce_queue()
  nvme: host: unquiesce queue in nvme_kill_queues()
  blk-mq: fix blk_mq_quiesce_queue
  blk-mq: update comments on blk_mq_quiesce_queue()
  blk-mq: don't stop queue for quiescing
  blk-mq: clarify dispatch may not be drained/blocked by stopping queue
  Revert "blk-mq: don't use sync workqueue flushing from drivers"

 block/blk-mq-sched.c |  3 +-
 block/blk-mq.c   | 78 
 drivers/md/dm-rq.c   |  2 +-
 drivers/nvme/host/core.c |  5 +++-
 drivers/scsi/scsi_lib.c  |  5 +++-
 include/linux/blkdev.h   |  6 
 6 files changed, 70 insertions(+), 29 deletions(-)

-- 
2.9.4

Re: [PATCH v2 5/8] blk-mq: update comments on blk_mq_quiesce_queue()

2017-05-31 Thread Ming Lei

On Tue, May 30, 2017 at 05:14:44PM +, Bart Van Assche wrote:
> On Sat, 2017-05-27 at 22:21 +0800, Ming Lei wrote:
> >  /**
> > - * blk_mq_quiesce_queue() - wait until all ongoing queue_rq calls have 
> > finished
> > + * blk_mq_quiesce_queue() - wait until all ongoing dispatching have 
> > finished
> >   * @q: request queue.
> >   *
> 
> Hello Ming,
> 
> The concept of dispatching does not have a meaning to block driver authors 
> who are
> not familiar with the block layer internals. However, every author of a 
> blk-mq driver
> knows what the .queue_rq() function is.

Unfortunately it isn't enough to just block .queue_rq(), did you read the
commit log?

> Additionally, the new comment is grammatically
> incorrect.
> So the above change looks like a step in the wrong direction to me.

Sorry, I simply don't agree, and we have to make it explicit.

-- 
Ming

Re: [PATCH 01/22] Revert "afs: Move UUID struct to linux/uuid.h"

2017-05-31 Thread Christoph Hellwig

On Tue, May 30, 2017 at 11:00:04AM +0100, David Howells wrote:
> This isn't going to work.  You've effectively changed the types of the fields
> in the UUID struct from BE to CPU-endian, but you're still calling
> generate_random_uuid(), which produces a BE UUID.  You need to leave the
> struct members as __beXX or stop using the core UUID routines.
> 
> Just move the struct uuid_v1 as-is to the afs headers and rename it to struct
> afs_uuid.  You can then leave the (un)marshalling code alone.

That's one option.  The other option would be to revert

"afs: Use core kernel UUID generation", as that also changed the
v1 UUID to a v4 uuid.  Does the afs protocol require a v1 uuid
or does it just use the formwat on the wire?

Re: [PATCH 09/10] xfs: nowait aio support

2017-05-31 Thread Jan Kara

On Tue 30-05-17 11:13:29, Goldwyn Rodrigues wrote:
> > Btw, can you write a small blurb up for the man page to document these
> > ѕemantics in man-page like language?
> > 
> 
> Yes, but which man page would it belong to?
> Should it be a subsection of errors in io_getevents/io_submit. We don't
> want to add ERRORS to io_getevents() because it would be the return
> value of the io_getevents call, and not the ones in the iocb structure.
> Should it be a new man page, say for iocb(7/8)?

I think you should extend the manpage for io_submit(8). There you can add
definition of struct iocb in 'DESCRIPTION' section explaining at least the
most common fields. You can also explain there which flags can be passed
and what are they intended to do.

You can also expand EAGAIN error description to specifically mention that
in case of NOWAIT aio EAGAIN can be returned if io submission would block.

Honza
-- 
Jan Kara 
SUSE Labs, CR

99 matches

Mail list logo