On 9/18/19 7:59 AM, Richard W.M. Jones wrote: > We have a running problem with the nbdkit VDDK plugin where the VDDK > side apparently disconnects or the network connection is interrupted. > During a virt-v2v conversion this causes the entire operation to fail, > and since v2v conversions take many hours that's not a happy outcome. > > (Aside: I should say that we see many cases where it's claimed that > the connection was dropped, but often when we examine them in detail > the cause is something else. But it seems like this disconnection > thing does happen sometimes.)
nbdkit is not alone - qemu is currently trying to add patches to add nbd reconnect: https://lists.gnu.org/archive/html/qemu-devel/2019-09/msg03621.html > > To put this isn't concrete terms which don't involve v2v, let's say > you were doing something like: > > nbdkit ssh host=remote /var/tmp/test.iso \ > --run 'qemu-img convert -p -f raw $nbd -O qcow2 test.qcow2' > > which copies a file over ssh to local. If /var/tmp/test.iso is very > large and/or the connection is very slow, and the network connection > is interrupted then the whole operation fails. If nbdkit could > retry/reconnect on failure then the operation might succeed. > > There are lots of parameters associated with retrying, eg: > > - how often should you retry before giving up? > > - how long should you wait between retries? > > - which errors should cause a retry, which are a hard failure? - do you want TCP keepalive active during the session > > So I had an idea we could implement this as a generic "retry" filter, > like: > > nbdkit ssh ... --filter=retry retries=5 retry-delay=5 retry-exponential=yes Interesting idea. > > This cannot be implemented with the current design of filters because > a filter would have to call the plugin .close and .open methods, but > filters don't have access to those from regular data functions, and in > any case this would cause a new plugin handle to be allocated. Our .open handling is already odd: we document but do not enforce that a filter must call next_open on success, but does not necessarily do so on failure. Depending on where things fail, it may be possible that we have a memory leak and/or end up calling .close without a matching .open; I'm trying to come up with a definitive test case demonstrating if that is a problem. I noticed that while trying to make nbdkit return NBD_REP_ERR_XXX when .open fails, rather than dropping the connection altogether (since that's a case where a single TCP connection would need to result in multiple .open/.close pairings). > We could probably do it if we added a special .reopen method to > plugins. We could either require plugins which support the concept of > retrying to implement this, or we could have a generic implementation > in server/backend.c which would call .close, .open and cope with the > new handle. It sounds like something that only needs to be exposed for filters to use; I'm having a hard time seeing how a plugin would do it, so keeping the magic in server/backend.c sounds reasonable. > Another way to do this would be to modify each plugin to add the > feature. nbdkit-nbd-plugin has this for a very limited case, but no > others do, and it's quite complex to implement in plugins. As far as > I can see it involves checking the return value of any data call that > the plugin makes and performing the reconnection logic, while not > changing the handle (so just calling self->close, self->open isn't > going to work). > > If anyone has any thoughts about this I'd be happy to hear them. > > Rich. > -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Libguestfs mailing list [email protected] https://www.redhat.com/mailman/listinfo/libguestfs
