Re: [Nbd] [PATCH] nbd: improve request timeouts handling

Markus Pargmann Wed, 04 Mar 2015 23:18:47 -0800

Hi Michal,

On Tue, Mar 03, 2015 at 11:26:39PM +0100, Michal Belczyk wrote:
> Hi Markus,
> 
> sorry for the late reply, please find my comments inline...


not late at all.

> 
> On Sun, Mar 01, 2015 at 09:39:19PM +0100, Markus Pargmann wrote:
> > Hi Michal,
> > 
> > First of all, thanks for resending this. It works for my test setup but
> > I have some comments inline below.
> > 
> > On Thu, Feb 26, 2015 at 05:37:12PM +0100, Michal Belczyk wrote:
> > > Signed-off-by: Michal Belczyk <belc...@bsd.krakow.pl>
> > > ---
> > >  drivers/block/nbd.c | 150 
> > > +++++++++++++++++++++++++++++++++++++++++-----------
> > >  include/linux/nbd.h |   6 +++
> > >  2 files changed, 125 insertions(+), 31 deletions(-)
> > > 
> > > diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> > > index 4bc2a5c..cc4a98a 100644
> > > --- a/drivers/block/nbd.c
> > > +++ b/drivers/block/nbd.c
> > > @@ -140,11 +140,24 @@ static void sock_shutdown(struct nbd_device *nbd, 
> > > int lock)
> > >  
> > >  static void nbd_xmit_timeout(unsigned long arg)
> > >  {
> > > - struct task_struct *task = (struct task_struct *)arg;
> > > + struct nbd_device *nbd = (struct nbd_device *)arg;
> > > + struct task_struct *task_ary[2];
> > > + unsigned long flags;
> > > + int i;
> > >  
> > > - printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n",
> > > -         task->comm, task->pid);
> > > - force_sig(SIGKILL, task);
> > > + spin_lock_irqsave(&nbd->timer_lock, flags);
> > > + nbd->timedout = 1;
> > > + task_ary[0] = nbd->sender;
> > > + task_ary[1] = nbd->receiver;
> > > + for (i = 0; i < 2; i++) {
> > > +         if (task_ary[i] == NULL)
> > > +                 continue;
> > > +         printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n",
> > > +                 task_ary[i]->comm, task_ary[i]->pid);
> > > +         force_sig(SIGKILL, task_ary[i]);
> > > +         break;
> > > + }
> > > + spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > >  }
> > >  
> > >  /*
> > > @@ -158,7 +171,7 @@ static int sock_xmit(struct nbd_device *nbd, int 
> > > send, void *buf, int size,
> > >   struct msghdr msg;
> > >   struct kvec iov;
> > >   sigset_t blocked, oldset;
> > > - unsigned long pflags = current->flags;
> > > + unsigned long flags, pflags = current->flags;
> > 
> > I would prefer variable declerations in seperate lines, especially if
> > there is an assignment involved, for readability.
> 
> Feel free to change it to whatever style you prefer.
> 
> 
> > >   if (unlikely(!sock)) {
> > >           dev_err(disk_to_dev(nbd->disk),
> > > @@ -183,23 +196,39 @@ static int sock_xmit(struct nbd_device *nbd, int 
> > > send, void *buf, int size,
> > >           msg.msg_controllen = 0;
> > >           msg.msg_flags = msg_flags | MSG_NOSIGNAL;
> > >  
> > > -         if (send) {
> > > -                 struct timer_list ti;
> > > -
> > > -                 if (nbd->xmit_timeout) {
> > > -                         init_timer(&ti);
> > > -                         ti.function = nbd_xmit_timeout;
> > > -                         ti.data = (unsigned long)current;
> > > -                         ti.expires = jiffies + nbd->xmit_timeout;
> > > -                         add_timer(&ti);
> > > +         if (nbd->xmit_timeout) {
> > > +                 spin_lock_irqsave(&nbd->timer_lock, flags);
> > > +                 if (nbd->timedout) {
> > > +                         spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > > +                         printk(KERN_WARNING
> > > +                                 "nbd (pid %d: %s) timed out\n",
> > > +                                 task_pid_nr(current), current->comm);
> > > +                         result = -EINTR;
> > > +                         sock_shutdown(nbd, !send);
> > > +                         break;
> > 
> > I am not a big fan of having this block with a sock_shutdown at two
> > points in sock_xmit, here as reaction on nbd->timedout and below as
> > handler of the signal received.
> 
> Me neither, feel free to change it to whatever you prefer as long as it
> is functionally the same thing.
> The whole patch was not meant to be pushed as-is but rather as an
> initial sketch of code to discuss, so far ignored for over a year...

Yes, sorry for that.

> 
> 
> > >                   }
> > > +                 if (send)
> > > +                         nbd->sender = current;
> > > +                 else
> > > +                         nbd->receiver = current;
> > > +                 spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > > +         }
> > > +
> > 
> > What happens if the timer interrupt is happens right at this point?
> > The other process may just fall out of the kernel_send/recvmsg() call
> > through the signal and call into sock_shutdown(). There it would wait
> > for the tx_lock (assuming it is receiving). At the same time this
> > process would continue into kernel_sendmsg. There is no active timer
> > anymore and the other process is waiting in the lock.
> > Is there something I am missing or may this be a problem?
> 
> That bit of the code path is somewhat tricky.
> Most writes to the socket will be non-blocking as long as the sockbuf
> permits.  If I understand correctly you are most concerned of the path
> where the SIGKILL(timeout) to the sender thread is queued prior to entering
> kernel_sendmsg() which would block it indefinitely and also cause the
> receiver process to wait for the tx_lock in its shutdown path, right?
> When that happens, the semantics should be similar to a regular write()
> on the blocking socket -- return -1 and set errno to EINTR unless
> sigaction(SA_RESTART) foo was done, etc.
> This actually works almost the same way in our case -- the pending
> signal will cause kernel_sendmsg() to eventually return -ERESTARTSYS.
> You may actually verify this theory by adding:
> 
>     if (send) {
>         force_sig(SIGKILL, current);
>       result = kernel_sendmsg(sock, &msg, &iov, 1, size);
>       printk(KERN_WARNING "result=%d\n", result);
>       if (result >= 0)
>           flush_signals(current);
>     }
> 
> Then if you reduce the default sockbuf size (net.ipv4.tcp_wmem) to, say,
> 4KB for min/initial/max, a simple dd if=/dev/zero of=/dev/nbdX bs=512K
> count=1 oflag=direct should do the trick -- dmesg will reveal what
> happened...
> 
> Please correct me if I am wrong.

Thanks for the explanation, I forgot that there is a signal pending for
the second process as well. I will test this to be sure but it should work
then.

> 
> 
> > > +         if (send)
> > >                   result = kernel_sendmsg(sock, &msg, &iov, 1, size);
> > > -                 if (nbd->xmit_timeout)
> > > -                         del_timer_sync(&ti);
> > > -         } else
> > > +         else
> > >                   result = kernel_recvmsg(sock, &msg, &iov, 1, size,
> > >                                           msg.msg_flags);
> > >  
> > > +         if (nbd->xmit_timeout) {
> > > +                 spin_lock_irqsave(&nbd->timer_lock, flags);
> > > +                 if (send)
> > > +                         nbd->sender = NULL;
> > > +                 else
> > > +                         nbd->receiver = NULL;
> > > +                 spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > > +         }
> > > +
> > >           if (signal_pending(current)) {
> > >                   siginfo_t info;
> > >                   printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n",
> > > @@ -226,12 +255,12 @@ static int sock_xmit(struct nbd_device *nbd, int 
> > > send, void *buf, int size,
> > >  }
> > >  
> > >  static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec 
> > > *bvec,
> > > -         int flags)
> > > +         int msg_flags)
> > >  {
> > >   int result;
> > >   void *kaddr = kmap(bvec->bv_page);
> > >   result = sock_xmit(nbd, 1, kaddr + bvec->bv_offset,
> > > -                    bvec->bv_len, flags);
> > > +                    bvec->bv_len, msg_flags);
> > 
> > This renaming from flags to msg_flags seems unrelated. I also don't see
> > why this is necessary. Please split this into a separate patch if it is
> > necessary.
> 
> This is unrelated, you may well remove it from the patch.
> 
> 
> > >   kunmap(bvec->bv_page);
> > >   return result;
> > >  }
> > > @@ -239,9 +268,9 @@ static inline int sock_send_bvec(struct nbd_device 
> > > *nbd, struct bio_vec *bvec,
> > >  /* always call with the tx_lock held */
> > >  static int nbd_send_req(struct nbd_device *nbd, struct request *req)
> > >  {
> > > - int result, flags;
> > > + int result, msg_flags;
> > 
> > Same as above, please rename msg_flags in a separate patch.
> > Perhaps 'irq_flags' for the new 'flags' variable would be better?
> 
> Whatever you prefer.
> 
> 
> > >   struct nbd_request request;
> > > - unsigned long size = blk_rq_bytes(req);
> > > + unsigned long flags, size = blk_rq_bytes(req);
> > >  
> > >   memset(&request, 0, sizeof(request));
> > >   request.magic = htonl(NBD_REQUEST_MAGIC);
> > > @@ -253,6 +282,19 @@ static int nbd_send_req(struct nbd_device *nbd, 
> > > struct request *req)
> > >   }
> > >   memcpy(request.handle, &req, sizeof(req));
> > >  
> > > + if (nbd->xmit_timeout) {
> > > +         spin_lock_irqsave(&nbd->timer_lock, flags);
> > > +         if (!nbd->inflight) {
> > > +                 nbd->req_timer.function = nbd_xmit_timeout;
> > > +                 nbd->req_timer.data = (unsigned long)nbd;
> > 
> > Isn't it possible to assign the above two at initialization?
> 
> I think the first two should be ok to move to the init code.
> 
> 
> > > +                 nbd->req_timer.expires = jiffies + nbd->xmit_timeout;
> > > +                 add_timer(&nbd->req_timer);
> > > +         }
> > > +         nbd->inflight++;
> > > +         BUG_ON(nbd->inflight <= 0);
> > > +         spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > > + }
> > > +
> > >   dprintk(DBG_TX, "%s: request %p: sending control (%s@%llu,%uB)\n",
> > >                   nbd->disk->disk_name, req,
> > >                   nbdcmd_to_ascii(nbd_cmd(req)),
> > > @@ -274,12 +316,12 @@ static int nbd_send_req(struct nbd_device *nbd, 
> > > struct request *req)
> > >            * whether to set MSG_MORE or not...
> > >            */
> > >           rq_for_each_segment(bvec, req, iter) {
> > > -                 flags = 0;
> > > +                 msg_flags = 0;
> > >                   if (!rq_iter_last(bvec, iter))
> > > -                         flags = MSG_MORE;
> > > +                         msg_flags = MSG_MORE;
> > >                   dprintk(DBG_TX, "%s: request %p: sending %d bytes 
> > > data\n",
> > >                                   nbd->disk->disk_name, req, bvec.bv_len);
> > > -                 result = sock_send_bvec(nbd, &bvec, flags);
> > > +                 result = sock_send_bvec(nbd, &bvec, msg_flags);
> > >                   if (result <= 0) {
> > >                           dev_err(disk_to_dev(nbd->disk),
> > >                                   "Send data failed (result %d)\n",
> > > @@ -291,6 +333,14 @@ static int nbd_send_req(struct nbd_device *nbd, 
> > > struct request *req)
> > >   return 0;
> > >  
> > >  error_out:
> > > + if (nbd->xmit_timeout) {
> > > +         spin_lock_irqsave(&nbd->timer_lock, flags);
> > > +         nbd->inflight--;
> > > +         BUG_ON(nbd->inflight < 0);
> > 
> > nbd->inflight was checked above already with BUG_ON().
> 
> The BUG_ON()'s may well be removed -- this is a leftover from my initial
> testing of this patch to make sure the counters do not overflow or go
> negative -- do whatever you want with it (I'd leave them as they are,
> but it's up to you).
> 
> 
> > > +         if (!nbd->inflight)
> > > +                 del_timer_sync(&nbd->req_timer);
> > > +         spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > > + }
> > >   return -EIO;
> > >  }
> > >  
> > > @@ -412,24 +462,41 @@ static struct device_attribute pid_attr = {
> > >  static int nbd_do_it(struct nbd_device *nbd)
> > >  {
> > >   struct request *req;
> > > + unsigned long flags;
> > >   int ret;
> > >  
> > >   BUG_ON(nbd->magic != NBD_MAGIC);
> > >  
> > >   sk_set_memalloc(nbd->sock->sk);
> > > - nbd->pid = task_pid_nr(current);
> > >   ret = device_create_file(disk_to_dev(nbd->disk), &pid_attr);
> > >   if (ret) {
> > >           dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n");
> > > -         nbd->pid = 0;
> > >           return ret;
> > >   }
> > >  
> > > - while ((req = nbd_read_stat(nbd)) != NULL)
> > > -         nbd_end_request(req);
> > > + for (;;) {
> > > +         req = nbd_read_stat(nbd);
> > > +         if (nbd->xmit_timeout) {
> > > +                 spin_lock_irqsave(&nbd->timer_lock, flags);
> > > +                 if (req != NULL) {
> > > +                         nbd->inflight--;
> > > +                         BUG_ON(nbd->inflight < 0);
> > > +                 }
> > > +                 if (req != NULL && nbd->inflight)
> > > +                         mod_timer(&nbd->req_timer,
> > > +                                   jiffies + nbd->xmit_timeout);
> > > +                 else
> > > +                         del_timer_sync(&nbd->req_timer);
> > > +                 spin_unlock_irqrestore(&nbd->timer_lock, flags);
> > > +         }
> > > +         if (req != NULL) {
> > > +                 nbd_end_request(req);
> > > +                 continue;
> > > +         }
> > > +         break;
> > > + }
> > 
> > Please make this loop a bit more readable. Perhaps something like this?
> > 
> >     while ((req = nbd_read_stat(nbd)) != NULL) {
> >             nbd_end_request(req);
> >             if (nbd->xmit_timeout) {
> >                     ...
> >             }
> >     }
> >     if (nbd->xmit_timeout) {
> >             lock();
> >             del_timer_sync();
> >             unlock();
> >     }
> 
> I'm fine with either -- a loop is a loop.  Rework it any way which makes
> you more comfortable to work with this code later.
> 
> 
> > >   device_remove_file(disk_to_dev(nbd->disk), &pid_attr);
> > > - nbd->pid = 0;
> > 
> > Why is this removed?
> 
> Not sure over a year after the initial patch submission but looking at
> my commit message it must have something to do with its last bit -- the
> non-locked access to nbd->pid?
> 
> 
> > >   return 0;
> > >  }
> > >  
> > > @@ -669,9 +736,20 @@ static int __nbd_ioctl(struct block_device *bdev, 
> > > struct nbd_device *nbd,
> > >           set_capacity(nbd->disk, nbd->bytesize >> 9);
> > >           return 0;
> > >  
> > > - case NBD_SET_TIMEOUT:
> > > -         nbd->xmit_timeout = arg * HZ;
> > > + case NBD_SET_TIMEOUT: {
> > > +         int xt;
> > > +
> > > +         xt = arg * HZ;
> > > +         if (xt < 0)
> > > +                 return -EINVAL;
> > > +         if (nbd->pid &&
> > 
> > Could you use nbd->sock here please? This doesn't really depend on a pid
> > does it?
> 
> I remember that when I started looking at this code initially I found
> the whole nbd->pid / nbd->sock kind of redundant... I stuck with using
> nbd->pid but maybe it was the wrong way...

That's fine. I also think they are kind of redundant. I will probably
replace the usage of nbd->pid with nbd->sock so that nbd->pid is just
exported as read-only file to userspace as debugging help.

> 
> 
> > > +             ((!nbd->xmit_timeout && xt) || (nbd->xmit_timeout && !xt)))
> > > +                 return -EBUSY;
> > > +         dev_info(disk_to_dev(nbd->disk), "NBD_SET_TIMEOUT: %d -> %d\n",
> > > +                 nbd->xmit_timeout / HZ, xt / HZ);
> > 
> > Please use dev_dbg here.
> 
> Feel free to change it to what now is preferred.
> 
> 
> > > +         nbd->xmit_timeout = xt;
> > >           return 0;
> > > + }
> > >  
> > >   case NBD_SET_FLAGS:
> > >           nbd->flags = arg;
> > > @@ -694,6 +772,11 @@ static int __nbd_ioctl(struct block_device *bdev, 
> > > struct nbd_device *nbd,
> > >           if (!nbd->sock)
> > >                   return -EINVAL;
> > >  
> > > +         nbd->pid = task_pid_nr(current);
> > > +         nbd->inflight = 0;
> > > +         nbd->timedout = 0;
> > > +         nbd->sender = NULL;
> > > +         nbd->receiver = NULL;
> > >           mutex_unlock(&nbd->tx_lock);
> > >  
> > >           if (nbd->flags & NBD_FLAG_READ_ONLY)
> > > @@ -710,6 +793,7 @@ static int __nbd_ioctl(struct block_device *bdev, 
> > > struct nbd_device *nbd,
> > >                                   nbd->disk->disk_name);
> > >           if (IS_ERR(thread)) {
> > >                   mutex_lock(&nbd->tx_lock);
> > > +                 nbd->pid = 0;
> > 
> > I can't really see why it is necessary to get nbd->pid locked? And if
> > you need this, please split it into a separate patch.
> 
> I think it has to be locked for multiple concurrent nbd-client
> invocations, doesn't it?
> For already established /dev/nbdX you may run things like nbd-client -l
> which end up running exactly the same kernel code and using the same nbd
> structure in the kernel, right?  Or am I being paranoid here?

Yes they are using nbd->pid as well. But there shouldn't be any
concurrent users of this that highly depend on this. But I see why it
may be better to lock this.

Thanks,

Markus Pargmann

> 
> 
> > >                   return PTR_ERR(thread);
> > >           }
> > >           wake_up_process(thread);
> > > @@ -717,6 +801,7 @@ static int __nbd_ioctl(struct block_device *bdev, 
> > > struct nbd_device *nbd,
> > >           kthread_stop(thread);
> > >  
> > >           mutex_lock(&nbd->tx_lock);
> > > +         nbd->pid = 0;
> > >           if (error)
> > >                   return error;
> > >           sock_shutdown(nbd, 0);
> > > @@ -731,6 +816,7 @@ static int __nbd_ioctl(struct block_device *bdev, 
> > > struct nbd_device *nbd,
> > >                   sockfd_put(sock);
> > >           nbd->flags = 0;
> > >           nbd->bytesize = 0;
> > > +         nbd->xmit_timeout = 0;
> > >           bdev->bd_inode->i_size = 0;
> > >           set_capacity(nbd->disk, 0);
> > >           if (max_part > 0)
> > > @@ -874,6 +960,8 @@ static int __init nbd_init(void)
> > >           init_waitqueue_head(&nbd_dev[i].waiting_wq);
> > >           nbd_dev[i].blksize = 1024;
> > >           nbd_dev[i].bytesize = 0;
> > > +         spin_lock_init(&nbd_dev[i].timer_lock);
> > > +         init_timer(&nbd_dev[i].req_timer);
> > >           disk->major = NBD_MAJOR;
> > >           disk->first_minor = i << part_shift;
> > >           disk->fops = &nbd_fops;
> > > diff --git a/include/linux/nbd.h b/include/linux/nbd.h
> > > index f62f78a..c1280ca 100644
> > > --- a/include/linux/nbd.h
> > > +++ b/include/linux/nbd.h
> > > @@ -41,6 +41,12 @@ struct nbd_device {
> > >   pid_t pid; /* pid of nbd-client, if attached */
> > >   int xmit_timeout;
> > >   int disconnect; /* a disconnect has been requested by user */
> > > + spinlock_t timer_lock;
> > > + struct timer_list req_timer;
> > > + int inflight;
> > > + int timedout;
> > 
> > Do we really need 'timedout'? or is it possible to use the pending
> > signal to immediately shutdown the nbd?
> 
> Hmm... possibly.
> 
> Kind regards,
> 
> -- 
> Michal Belczyk Sr.
> 

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

signature.asc
Description: Digital signature

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/

_______________________________________________
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general

Re: [Nbd] [PATCH] nbd: improve request timeouts handling

Reply via email to