Hi Michal,

First of all, thanks for resending this. It works for my test setup but
I have some comments inline below.

On Thu, Feb 26, 2015 at 05:37:12PM +0100, Michal Belczyk wrote:
> The main idea behind it is to be able to quickly detect broken replica
> and switch over to another when used with any sort of mirror type device
> built on top of any number of nbd devices.
> 
> Before this change a request would time out causing the socket to be
> shut down and the device to fail in case of a dead server or removed
> network cable only if:
> 
>   a) either the timer around kernel_sendmsg() kicked in
>   b) or the TCP failures on retransmission finally caused an error
>      on the socket, likely blocked on kernel_recvmsg() at this time,
>      waiting for replies from the server
> 
> Case a) depends mostly on the size of requests issued and on the maximum
> size of the socket buffer -- a lot of read request headers or small
> write requests could be "sent" without triggering the requested timeout
> 
> Case b) timeout is independent of nbd-client -t <timeout> option
> as there is no TCP_USER_TIMEOUT set on the client socket by default.
> And even if such timeout was set it would not solve the problem of
> an nbd-client hung on receiving replies for much longer time without
> setting TCP keep-alives... and that would be the third, independent
> timeout setting required to make it work "almost" as expected...
> 
> So, instead, take the big hammer approach and:
> 
>   *) trace the number of outstanding requests sent to the server
>      (nbd->inflight)
>   *) enable the timer (nbd->req_timer) before the first request
>      is submitted and leave it enabled
>   *) when sending next request do not touch the timer (it is up
>      to the receiving side to control it at this point)
>   *) on receive side update the timer every time a response
>      is collected but there are more to read from the server
>   *) disable the timer whenever the inflight counter drops
>      to zero or an error (leading to the socket shutdown)
>      is returned
> 
> This patch does NOT prevent the server to process a request for longer
> than the timeout specified if only it replies to any other request
> submitted within the timeout (the server still may reply to a batch
> of requests in any order).
> 
> Only the nbd->xmit_timeout != 0 code path is changed so the patch should
> not affect nbd connections running without an explicit timeout set
> on the nbd-client command line.
> 
> There is also no way to enable or disable the timeout on an active
> (nbd->pid != 0) nbd device, it is however possible to change its value.
> Otherwise the inflight request counter would have to affect the nbd
> devices enabled without nbd-client -t <timeout>.
> 
> Also move nbd->pid modifications behind nbd->tx_lock wherever possible
> to avoid races between the concurrent nbd-client invocations.

This commit message is really good.

> 
> Signed-off-by: Michal Belczyk <belc...@bsd.krakow.pl>
> ---
>  drivers/block/nbd.c | 150 
> +++++++++++++++++++++++++++++++++++++++++-----------
>  include/linux/nbd.h |   6 +++
>  2 files changed, 125 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> index 4bc2a5c..cc4a98a 100644
> --- a/drivers/block/nbd.c
> +++ b/drivers/block/nbd.c
> @@ -140,11 +140,24 @@ static void sock_shutdown(struct nbd_device *nbd, int 
> lock)
>  
>  static void nbd_xmit_timeout(unsigned long arg)
>  {
> -     struct task_struct *task = (struct task_struct *)arg;
> +     struct nbd_device *nbd = (struct nbd_device *)arg;
> +     struct task_struct *task_ary[2];
> +     unsigned long flags;
> +     int i;
>  
> -     printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n",
> -             task->comm, task->pid);
> -     force_sig(SIGKILL, task);
> +     spin_lock_irqsave(&nbd->timer_lock, flags);
> +     nbd->timedout = 1;
> +     task_ary[0] = nbd->sender;
> +     task_ary[1] = nbd->receiver;
> +     for (i = 0; i < 2; i++) {
> +             if (task_ary[i] == NULL)
> +                     continue;
> +             printk(KERN_WARNING "nbd: killing hung xmit (%s, pid: %d)\n",
> +                     task_ary[i]->comm, task_ary[i]->pid);
> +             force_sig(SIGKILL, task_ary[i]);
> +             break;
> +     }
> +     spin_unlock_irqrestore(&nbd->timer_lock, flags);
>  }
>  
>  /*
> @@ -158,7 +171,7 @@ static int sock_xmit(struct nbd_device *nbd, int send, 
> void *buf, int size,
>       struct msghdr msg;
>       struct kvec iov;
>       sigset_t blocked, oldset;
> -     unsigned long pflags = current->flags;
> +     unsigned long flags, pflags = current->flags;

I would prefer variable declerations in seperate lines, especially if
there is an assignment involved, for readability.

>  
>       if (unlikely(!sock)) {
>               dev_err(disk_to_dev(nbd->disk),
> @@ -183,23 +196,39 @@ static int sock_xmit(struct nbd_device *nbd, int send, 
> void *buf, int size,
>               msg.msg_controllen = 0;
>               msg.msg_flags = msg_flags | MSG_NOSIGNAL;
>  
> -             if (send) {
> -                     struct timer_list ti;
> -
> -                     if (nbd->xmit_timeout) {
> -                             init_timer(&ti);
> -                             ti.function = nbd_xmit_timeout;
> -                             ti.data = (unsigned long)current;
> -                             ti.expires = jiffies + nbd->xmit_timeout;
> -                             add_timer(&ti);
> +             if (nbd->xmit_timeout) {
> +                     spin_lock_irqsave(&nbd->timer_lock, flags);
> +                     if (nbd->timedout) {
> +                             spin_unlock_irqrestore(&nbd->timer_lock, flags);
> +                             printk(KERN_WARNING
> +                                     "nbd (pid %d: %s) timed out\n",
> +                                     task_pid_nr(current), current->comm);
> +                             result = -EINTR;
> +                             sock_shutdown(nbd, !send);
> +                             break;

I am not a big fan of having this block with a sock_shutdown at two
points in sock_xmit, here as reaction on nbd->timedout and below as
handler of the signal received.

>                       }
> +                     if (send)
> +                             nbd->sender = current;
> +                     else
> +                             nbd->receiver = current;
> +                     spin_unlock_irqrestore(&nbd->timer_lock, flags);
> +             }
> +

What happens if the timer interrupt is happens right at this point?
The other process may just fall out of the kernel_send/recvmsg() call
through the signal and call into sock_shutdown(). There it would wait
for the tx_lock (assuming it is receiving). At the same time this
process would continue into kernel_sendmsg. There is no active timer
anymore and the other process is waiting in the lock.
Is there something I am missing or may this be a problem?

> +             if (send)
>                       result = kernel_sendmsg(sock, &msg, &iov, 1, size);
> -                     if (nbd->xmit_timeout)
> -                             del_timer_sync(&ti);
> -             } else
> +             else
>                       result = kernel_recvmsg(sock, &msg, &iov, 1, size,
>                                               msg.msg_flags);
>  
> +             if (nbd->xmit_timeout) {
> +                     spin_lock_irqsave(&nbd->timer_lock, flags);
> +                     if (send)
> +                             nbd->sender = NULL;
> +                     else
> +                             nbd->receiver = NULL;
> +                     spin_unlock_irqrestore(&nbd->timer_lock, flags);
> +             }
> +
>               if (signal_pending(current)) {
>                       siginfo_t info;
>                       printk(KERN_WARNING "nbd (pid %d: %s) got signal %d\n",
> @@ -226,12 +255,12 @@ static int sock_xmit(struct nbd_device *nbd, int send, 
> void *buf, int size,
>  }
>  
>  static inline int sock_send_bvec(struct nbd_device *nbd, struct bio_vec 
> *bvec,
> -             int flags)
> +             int msg_flags)
>  {
>       int result;
>       void *kaddr = kmap(bvec->bv_page);
>       result = sock_xmit(nbd, 1, kaddr + bvec->bv_offset,
> -                        bvec->bv_len, flags);
> +                        bvec->bv_len, msg_flags);

This renaming from flags to msg_flags seems unrelated. I also don't see
why this is necessary. Please split this into a separate patch if it is
necessary.

>       kunmap(bvec->bv_page);
>       return result;
>  }
> @@ -239,9 +268,9 @@ static inline int sock_send_bvec(struct nbd_device *nbd, 
> struct bio_vec *bvec,
>  /* always call with the tx_lock held */
>  static int nbd_send_req(struct nbd_device *nbd, struct request *req)
>  {
> -     int result, flags;
> +     int result, msg_flags;

Same as above, please rename msg_flags in a separate patch.
Perhaps 'irq_flags' for the new 'flags' variable would be better?

>       struct nbd_request request;
> -     unsigned long size = blk_rq_bytes(req);
> +     unsigned long flags, size = blk_rq_bytes(req);
>  
>       memset(&request, 0, sizeof(request));
>       request.magic = htonl(NBD_REQUEST_MAGIC);
> @@ -253,6 +282,19 @@ static int nbd_send_req(struct nbd_device *nbd, struct 
> request *req)
>       }
>       memcpy(request.handle, &req, sizeof(req));
>  
> +     if (nbd->xmit_timeout) {
> +             spin_lock_irqsave(&nbd->timer_lock, flags);
> +             if (!nbd->inflight) {
> +                     nbd->req_timer.function = nbd_xmit_timeout;
> +                     nbd->req_timer.data = (unsigned long)nbd;

Isn't it possible to assign the above two at initialization?

> +                     nbd->req_timer.expires = jiffies + nbd->xmit_timeout;
> +                     add_timer(&nbd->req_timer);
> +             }
> +             nbd->inflight++;
> +             BUG_ON(nbd->inflight <= 0);
> +             spin_unlock_irqrestore(&nbd->timer_lock, flags);
> +     }
> +
>       dprintk(DBG_TX, "%s: request %p: sending control (%s@%llu,%uB)\n",
>                       nbd->disk->disk_name, req,
>                       nbdcmd_to_ascii(nbd_cmd(req)),
> @@ -274,12 +316,12 @@ static int nbd_send_req(struct nbd_device *nbd, struct 
> request *req)
>                * whether to set MSG_MORE or not...
>                */
>               rq_for_each_segment(bvec, req, iter) {
> -                     flags = 0;
> +                     msg_flags = 0;
>                       if (!rq_iter_last(bvec, iter))
> -                             flags = MSG_MORE;
> +                             msg_flags = MSG_MORE;
>                       dprintk(DBG_TX, "%s: request %p: sending %d bytes 
> data\n",
>                                       nbd->disk->disk_name, req, bvec.bv_len);
> -                     result = sock_send_bvec(nbd, &bvec, flags);
> +                     result = sock_send_bvec(nbd, &bvec, msg_flags);
>                       if (result <= 0) {
>                               dev_err(disk_to_dev(nbd->disk),
>                                       "Send data failed (result %d)\n",
> @@ -291,6 +333,14 @@ static int nbd_send_req(struct nbd_device *nbd, struct 
> request *req)
>       return 0;
>  
>  error_out:
> +     if (nbd->xmit_timeout) {
> +             spin_lock_irqsave(&nbd->timer_lock, flags);
> +             nbd->inflight--;
> +             BUG_ON(nbd->inflight < 0);

nbd->inflight was checked above already with BUG_ON().

> +             if (!nbd->inflight)
> +                     del_timer_sync(&nbd->req_timer);
> +             spin_unlock_irqrestore(&nbd->timer_lock, flags);
> +     }
>       return -EIO;
>  }
>  
> @@ -412,24 +462,41 @@ static struct device_attribute pid_attr = {
>  static int nbd_do_it(struct nbd_device *nbd)
>  {
>       struct request *req;
> +     unsigned long flags;
>       int ret;
>  
>       BUG_ON(nbd->magic != NBD_MAGIC);
>  
>       sk_set_memalloc(nbd->sock->sk);
> -     nbd->pid = task_pid_nr(current);
>       ret = device_create_file(disk_to_dev(nbd->disk), &pid_attr);
>       if (ret) {
>               dev_err(disk_to_dev(nbd->disk), "device_create_file failed!\n");
> -             nbd->pid = 0;
>               return ret;
>       }
>  
> -     while ((req = nbd_read_stat(nbd)) != NULL)
> -             nbd_end_request(req);
> +     for (;;) {
> +             req = nbd_read_stat(nbd);
> +             if (nbd->xmit_timeout) {
> +                     spin_lock_irqsave(&nbd->timer_lock, flags);
> +                     if (req != NULL) {
> +                             nbd->inflight--;
> +                             BUG_ON(nbd->inflight < 0);
> +                     }
> +                     if (req != NULL && nbd->inflight)
> +                             mod_timer(&nbd->req_timer,
> +                                       jiffies + nbd->xmit_timeout);
> +                     else
> +                             del_timer_sync(&nbd->req_timer);
> +                     spin_unlock_irqrestore(&nbd->timer_lock, flags);
> +             }
> +             if (req != NULL) {
> +                     nbd_end_request(req);
> +                     continue;
> +             }
> +             break;
> +     }

Please make this loop a bit more readable. Perhaps something like this?

        while ((req = nbd_read_stat(nbd)) != NULL) {
                nbd_end_request(req);
                if (nbd->xmit_timeout) {
                        ...
                }
        }
        if (nbd->xmit_timeout) {
                lock();
                del_timer_sync();
                unlock();
        }

>  
>       device_remove_file(disk_to_dev(nbd->disk), &pid_attr);
> -     nbd->pid = 0;

Why is this removed?

>       return 0;
>  }
>  
> @@ -669,9 +736,20 @@ static int __nbd_ioctl(struct block_device *bdev, struct 
> nbd_device *nbd,
>               set_capacity(nbd->disk, nbd->bytesize >> 9);
>               return 0;
>  
> -     case NBD_SET_TIMEOUT:
> -             nbd->xmit_timeout = arg * HZ;
> +     case NBD_SET_TIMEOUT: {
> +             int xt;
> +
> +             xt = arg * HZ;
> +             if (xt < 0)
> +                     return -EINVAL;
> +             if (nbd->pid &&

Could you use nbd->sock here please? This doesn't really depend on a pid
does it?

> +                 ((!nbd->xmit_timeout && xt) || (nbd->xmit_timeout && !xt)))
> +                     return -EBUSY;
> +             dev_info(disk_to_dev(nbd->disk), "NBD_SET_TIMEOUT: %d -> %d\n",
> +                     nbd->xmit_timeout / HZ, xt / HZ);

Please use dev_dbg here.

> +             nbd->xmit_timeout = xt;
>               return 0;
> +     }
>  
>       case NBD_SET_FLAGS:
>               nbd->flags = arg;
> @@ -694,6 +772,11 @@ static int __nbd_ioctl(struct block_device *bdev, struct 
> nbd_device *nbd,
>               if (!nbd->sock)
>                       return -EINVAL;
>  
> +             nbd->pid = task_pid_nr(current);
> +             nbd->inflight = 0;
> +             nbd->timedout = 0;
> +             nbd->sender = NULL;
> +             nbd->receiver = NULL;
>               mutex_unlock(&nbd->tx_lock);
>  
>               if (nbd->flags & NBD_FLAG_READ_ONLY)
> @@ -710,6 +793,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct 
> nbd_device *nbd,
>                                       nbd->disk->disk_name);
>               if (IS_ERR(thread)) {
>                       mutex_lock(&nbd->tx_lock);
> +                     nbd->pid = 0;

I can't really see why it is necessary to get nbd->pid locked? And if
you need this, please split it into a separate patch.

>                       return PTR_ERR(thread);
>               }
>               wake_up_process(thread);
> @@ -717,6 +801,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct 
> nbd_device *nbd,
>               kthread_stop(thread);
>  
>               mutex_lock(&nbd->tx_lock);
> +             nbd->pid = 0;
>               if (error)
>                       return error;
>               sock_shutdown(nbd, 0);
> @@ -731,6 +816,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct 
> nbd_device *nbd,
>                       sockfd_put(sock);
>               nbd->flags = 0;
>               nbd->bytesize = 0;
> +             nbd->xmit_timeout = 0;
>               bdev->bd_inode->i_size = 0;
>               set_capacity(nbd->disk, 0);
>               if (max_part > 0)
> @@ -874,6 +960,8 @@ static int __init nbd_init(void)
>               init_waitqueue_head(&nbd_dev[i].waiting_wq);
>               nbd_dev[i].blksize = 1024;
>               nbd_dev[i].bytesize = 0;
> +             spin_lock_init(&nbd_dev[i].timer_lock);
> +             init_timer(&nbd_dev[i].req_timer);
>               disk->major = NBD_MAJOR;
>               disk->first_minor = i << part_shift;
>               disk->fops = &nbd_fops;
> diff --git a/include/linux/nbd.h b/include/linux/nbd.h
> index f62f78a..c1280ca 100644
> --- a/include/linux/nbd.h
> +++ b/include/linux/nbd.h
> @@ -41,6 +41,12 @@ struct nbd_device {
>       pid_t pid; /* pid of nbd-client, if attached */
>       int xmit_timeout;
>       int disconnect; /* a disconnect has been requested by user */
> +     spinlock_t timer_lock;
> +     struct timer_list req_timer;
> +     int inflight;
> +     int timedout;

Do we really need 'timedout'? or is it possible to use the pending
signal to immediately shutdown the nbd?

Best Regards,

Markus

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

Attachment: signature.asc
Description: Digital signature

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general

Reply via email to