On March 11, 2020 11:44 am, Mira Limbeck wrote: > The NBD socket on the client side is kept open until the VM stops. In > the case of TCP this was not a problem, because we did not rely on it > closing. But with unix socket forwarding via SSH the SSH connection > can't close if the socket is still open. > > If it does not close by itself, it will be terminated after 30 seconds.
in which combinations did you test this? it prints pretty ugly warnings here for a VM with a single, local disk: new to old (TCP, but with stop before tunnel close): 2020-03-11 12:52:21 migration status: completed drive-scsi0: transferred: 17179869184 bytes remaining: 0 bytes total: 17179869184 bytes progression: 100.00 % busy: 0 ready: 1 all mirroring jobs are ready drive-scsi0: Completing block job... drive-scsi0: Completed successfully. drive-scsi0 : finished VM quit/powerdown failed - terminating now with SIGTERM VM still running - terminating now with SIGKILL 2020-03-11 12:52:38 migration finished successfully (duration 00:00:46) I haven't checked/verified, but I think the problem is that the source VM still has the NBD connection open, and when you call 'quit', it will attempt to flush the NBD connection and block on that (I ran into a similar problem with my websocket tunnel). new to new (UNIX, stop before tunnel close): 2020-03-11 12:51:25 migration status: completed drive-scsi0: transferred: 17179869184 bytes remaining: 0 bytes total: 17179869184 bytes progression: 100.00 % busy: 0 ready: 1 all mirroring jobs are ready drive-scsi0: Completing block job... drive-scsi0: Completed successfully. drive-scsi0 : finished channel 4: open failed: connect failed: open failed channel 3: open failed: connect failed: open failed channel 3: open failed: connect failed: open failed channel 3: open failed: connect failed: open failed channel 3: open failed: connect failed: open failed 2020-03-11 12:51:43 migration finished successfully (duration 00:01:22) if I interpret the strace correctly, here's what happens: 1. nbd connection gets closed from remote end (not directly visible in strace, since that happens with another ssh call) 2. local VM connects again to the UNIX socket (it should be paused at this moment?) 3. ssh accept()s connection and attempts to forward it 4. other end does not connect 5. error is returned to source node 3 is the SSH control channel, 4 and 5 are the UNIX sockets on the source node (nbd and memory), 6,7,8 are STDIN, STDOUT, STDERR, 9 and 10 are the accept-ed() client connections for the UNIX sockets (nbd and memory). note the shutdown and close for 10 and 9, followed by another accept on 4 with failure to connect, followed by the resume command via the tunnel and its reply. select(11, [3 4 5 6 9 10], [3], NULL, {tv_sec=300, tv_usec=0}) = 2 (in [10], out [3], left {tv_sec=299, tv_usec=999998}) read(10, "ffer\", \"size\" : 21 }, { \"name\" :"..., 16384) = 3100 write(3, "\0\0@\20\337\271Q\240\6DI\267]u\301\330\345\2632\341j\240\223\341Z\202\350r\220\261Q\304"..., 16412) = 16412 getpid() = 1540056 select(11, [3 4 5 6 9 10], [3], NULL, {tv_sec=300, tv_usec=0}) = 2 (in [10], out [3], left {tv_sec=299, tv_usec=999998}) read(10, "", 16384) = 0 shutdown(10, SHUT_RD) = 0 write(3, "\0\0\f0]\335J\351\327A\343\215\f\244x\35\227\362!\253u\225\222xSQ\312%f\202G\270"..., 3132) = 3132 getpid() = 1540056 select(11, [3 4 5 6 9], [3], NULL, {tv_sec=300, tv_usec=0}) = 1 (out [3], left {tv_sec=299, tv_usec=999998}) write(3, "\0\0\0\20O\32\347\372\227'\177\6\340\353\314\317X\10=\245\323\266nC]\341\272\376", 28) = 28 select(11, [3 4 5 6 9], [], NULL, {tv_sec=300, tv_usec=0}) = 1 (in [3], left {tv_sec=299, tv_usec=999404}) read(3, "\0\0\0\20\346Ra\334\n\354\252\304\230\340\3\270\177\260;\275\373\353\221\"\7\234\270\221", 8192) = 28 select(11, [3 4 5 6 9], [], NULL, {tv_sec=300, tv_usec=0}) = 1 (in [3], left {tv_sec=299, tv_usec=997529}) read(3, "\0\0\0\20\370u<I\215\222`U\235F\342\271SZbH\364\245\177\351\252\334e\333\0\0\0\20"..., 8192) = 56 shutdown(10, SHUT_WR) = 0 getpid() = 1540056 close(10) = 0 select(11, [3 4 5 6 9], [3], NULL, {tv_sec=300, tv_usec=0}) = 1 (out [3], left {tv_sec=299, tv_usec=999997}) write(3, "\0\0\0\20\37\213\3445(<d\vc\241\350Y\206Zn\220\261l\320\246\362\256\fn", 28) = 28 select(11, [3 4 5 6 9], [], NULL, {tv_sec=300, tv_usec=0}) = 1 (in [9], left {tv_sec=299, tv_usec=963873}) read(9, "%`\225\23\0\0\0\3\0\0\177&\203)_\0\0\0\0\0\0\0\0\0\0\0\0\0", 16384) = 28 getpid() = 1540056 select(11, [3 4 5 6 9], [3], NULL, {tv_sec=300, tv_usec=0}) = 1 (out [3], left {tv_sec=299, tv_usec=999998}) write(3, "\0\0\0000\225\275c\363\257\336\356E\310h6\335\307\363&\257\272\2377\5\35&O\264\332]\317\343"..., 60) = 60 select(11, [3 4 5 6 9], [], NULL, {tv_sec=300, tv_usec=0}) = 1 (in [3], left {tv_sec=299, tv_usec=999731}) read(3, "\0\0\0 \300\35\241Vbyzb\274\317y\1\237\f\vam\314\232\233\360\17x\364\354\331TO"..., 8192) = 44 select(11, [3 4 5 6 9], [9], NULL, {tv_sec=300, tv_usec=0}) = 1 (out [9], left {tv_sec=299, tv_usec=999998}) write(9, "gDf\230\0\0\0\0\0\0\177&\203)_\0", 16) = 16 select(11, [3 4 5 6 9], [], NULL, {tv_sec=300, tv_usec=0}) = 1 (in [9], left {tv_sec=299, tv_usec=999508}) read(9, "%`\225\23\0\0\0\3\0\0\177&\203)_\0\0\0\0\0\0\0\0\0\0\0\0\0", 16384) = 28 getpid() = 1540056 select(11, [3 4 5 6 9], [3], NULL, {tv_sec=300, tv_usec=0}) = 1 (out [3], left {tv_sec=299, tv_usec=999998}) write(3, "\0\0\0000\316\366\232\227\225\374\336I\226l\257P\317\257D\230\0\\\266\201\232+\201\20u\0215V"..., 60) = 60 select(11, [3 4 5 6 9], [], NULL, {tv_sec=300, tv_usec=0}) = 1 (in [3], left {tv_sec=299, tv_usec=999855}) read(3, "\0\0\0 \204J\273p\231\17\331\252\312\366^\207\243\252\246\236\304\300N\376\1\314\265\27\231\35iH"..., 8192) = 44 select(11, [3 4 5 6 9], [9], NULL, {tv_sec=300, tv_usec=0}) = 1 (out [9], left {tv_sec=299, tv_usec=999998}) write(9, "gDf\230\0\0\0\0\0\0\177&\203)_\0", 16) = 16 select(11, [3 4 5 6 9], [], NULL, {tv_sec=300, tv_usec=0}) = 1 (in [3], left {tv_sec=298, tv_usec=495094}) read(3, "\0\0\0\20\365\264\371\351\311.\367\242,\357\241\265\376\376\227\rj+\312\222\223\344\316\252", 8192) = 28 shutdown(9, SHUT_WR) = 0 select(11, [3 4 5 6 9], [], NULL, {tv_sec=300, tv_usec=0}) = 1 (in [9], left {tv_sec=299, tv_usec=999963}) read(9, "", 16384) = 0 shutdown(9, SHUT_RD) = 0 getpid() = 1540056 getpid() = 1540056 select(11, [3 4 5 6], [3], NULL, {tv_sec=300, tv_usec=0}) = 2 (in [4], out [3], left {tv_sec=299, tv_usec=999998}) accept(4, {sa_family=AF_UNIX}, [128->2]) = 10 fcntl(10, F_SETFD, FD_CLOEXEC) = 0 fcntl(10, F_GETFL) = 0x2 (flags O_RDWR) fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0 fcntl(10, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) getsockname(10, {sa_family=AF_UNIX, sun_path="/run/qemu-server/12341234_nbd.migrate"}, [128->40]) = 0 getsockname(10, {sa_family=AF_UNIX, sun_path="/run/qemu-server/12341234_nbd.migrate"}, [128->40]) = 0 getpeername(10, {sa_family=AF_UNIX}, [128->2]) = 0 getpeername(10, {sa_family=AF_UNIX}, [128->2]) = 0 getpid() = 1540056 write(3, "\0\0\0\20\371\31\34\304\377\210\22\213\356~\323h\367\217\3250ZK\354\31\21@o\257\0\0\0\20"..., 180) = 180 select(11, [3 4 5 6], [], NULL, {tv_sec=300, tv_usec=0}) = 1 (in [3], left {tv_sec=299, tv_usec=999600}) read(3, "\0\0\0000\226:m8\370!\251\32\224\274\v\375\267\221-\226`\277\271\206\350\313\324\312\252?\260\364"..., 8192) = 88 write(2, "channel 4: open failed: connect "..., 53) = 53 close(9) = 0 munmap(0x7f06cdafc000, 1101824) = 0 close(10) = 0 select(11, [3 4 5 6], [], NULL, {tv_sec=300, tv_usec=0}) = 1 (in [6], left {tv_sec=299, tv_usec=991723}) read(6, "resume 12341234\n", 16384) = 16 getpid() = 1540056 select(11, [3 4 5 6], [3], NULL, {tv_sec=300, tv_usec=0}) = 1 (out [3], left {tv_sec=299, tv_usec=999997}) write(3, "\0\0\0 \f\221\305P\266\317j\2454\377#\266\32$\212\234/\244\267a\366\262\336\366\250\221\354B"..., 44) = 44 select(11, [3 4 5 6], [], NULL, {tv_sec=300, tv_usec=0}) = 1 (in [3], left {tv_sec=299, tv_usec=990758}) read(3, "\0\0\0 D\7\336\260 \177C#\320\230p\327\235a\244\302z\31\354i\270\314\357e{\304~S"..., 8192) = 44 select(11, [3 4 5 6], [7], NULL, {tv_sec=300, tv_usec=0}) = 1 (out [7], left {tv_sec=299, tv_usec=999997}) write(7, "OK\n", 3) = 3 select(11, [3 4 5 6], [], NULL, {tv_sec=300, tv_usec=0}) = 1 (in [4], left {tv_sec=299, tv_usec=996422}) old to new works fine ;) > > Signed-off-by: Mira Limbeck <m.limb...@proxmox.com> > --- > PVE/QemuMigrate.pm | 18 +++++++++--------- > 1 file changed, 9 insertions(+), 9 deletions(-) > > diff --git a/PVE/QemuMigrate.pm b/PVE/QemuMigrate.pm > index 44e4c57..e569a2c 100644 > --- a/PVE/QemuMigrate.pm > +++ b/PVE/QemuMigrate.pm > @@ -1035,15 +1035,6 @@ sub phase3_cleanup { > } > } > > - # close tunnel on successful migration, on error phase2_cleanup closed it > - if ($tunnel) { > - eval { finish_tunnel($self, $tunnel); }; > - if (my $err = $@) { > - $self->log('err', $err); > - $self->{errors} = 1; > - } > - } > - > eval { > my $timer = 0; > if (PVE::QemuServer::vga_conf_has_spice($conf->{vga}) && > $self->{running}) { > @@ -1065,6 +1056,15 @@ sub phase3_cleanup { > $self->{errors} = 1; > } > > + # close tunnel on successful migration, on error phase2_cleanup closed it > + if ($tunnel) { > + eval { finish_tunnel($self, $tunnel); }; > + if (my $err = $@) { > + $self->log('err', $err); > + $self->{errors} = 1; > + } > + } > + > # always deactivate volumes - avoid lvm LVs to be active on several nodes > eval { > my $vollist = PVE::QemuServer::get_vm_volumes($conf); > -- > 2.20.1 > > > _______________________________________________ > pve-devel mailing list > pve-devel@pve.proxmox.com > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > > _______________________________________________ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel