On Fri, Dec 23, 2016 at 10:20:03AM +0100, Alexandre DERUMIER wrote: > Hi wolfgang, > > I have done more test with drive mirror and nbd target. > > It seem that the hang occur only if the target ip is unreacheable (no network > reponse) > > # drive_mirror -n drive-scsi0 nbd://66.66.66.66/target > ERROR: VM 183 qmp command 'human-monitor-command' failed - got timeout > > If the ip address exist and up, > # drive_mirror -n drive-scsi0 nbd://10.3.94.89:666/target > Failed to connect socket: Connection refused > > > > I'm not sure, maybe it can hang too if pve-firewall do a drop instead a > reject on target port. > > > > I think this come from in qemu net/socket.c, > > where we have an infinite loop. > > I'm not sure how to add a timeout here, help is welcome :)
Doesn't the job show up in query-block-jobs before the connection finishes? If so then we could just assume if it's still at 0% after a timeout the connection doesn't work? The question is though, what is a good timeout? Or we find (or add) a way to query the connection status, then we can poll that before we start polling the block-job status. (we already poll for stuff, so it's not worse than before ;p) Then again, an actual timeout on qemu's side might even make it upstream...? Maybe? (If it doesn't I'd prefer polling as described above over maintaining yet another qemu patch ;-) ) But it won't be a simple select/poll timeout as they use non-blocking sockets and register the fd with a callback for the connection completion later (in net_socket_fd_init() it calls qemu_set_fd_handler() where it ends up controlled by the aio handlers). It might be possible to add a timer (qemu_timer_new() & friends perhaps?) to NetSocketState (the NetSocktState cleanup would have to remove it, and if it gets triggered it cancels the connection instead). > static int net_socket_connect_init(NetClientState *peer, > const char *model, > const char *name, > const char *host_str) > { > NetSocketState *s; > int fd, connected, ret; > struct sockaddr_in saddr; > > if (parse_host_port(&saddr, host_str) < 0) > return -1; > > fd = qemu_socket(PF_INET, SOCK_STREAM, 0); > if (fd < 0) { > perror("socket"); > return -1; > } > qemu_set_nonblock(fd); > > connected = 0; > for(;;) { > ret = connect(fd, (struct sockaddr *)&saddr, sizeof(saddr)); > if (ret < 0) { > if (errno == EINTR || errno == EWOULDBLOCK) { > /* continue */ > } else if (errno == EINPROGRESS || > errno == EALREADY || > errno == EINVAL) { > break; > } else { > perror("connect"); > closesocket(fd); > return -1; > } > } else { > connected = 1; > break; > } > } > s = net_socket_fd_init(peer, model, name, fd, connected); > if (!s) > return -1; > snprintf(s->nc.info_str, sizeof(s->nc.info_str), > "socket: connect to %s:%d", > inet_ntoa(saddr.sin_addr), ntohs(saddr.sin_port)); > return 0; > } > > > > > ----- Mail original ----- > De: "aderumier" <aderum...@odiso.com> > À: "Wolfgang Bumiller" <w.bumil...@proxmox.com> > Cc: "pve-devel" <pve-devel@pve.proxmox.com> > Envoyé: Mercredi 21 Décembre 2016 13:57:10 > Objet: Re: [pve-devel] [PATCH 2/6] qemu_drive_mirror : handle multiple jobs > > >>Then it can still hang if the destination disappears between tcp_ping() > >>and the `drive-mirror` command, so I'd rather get better behavior on qemu's > >>side. It needs a time-out or a way to cancel it or something. > Yes sure! > > I'm currently looking at qemu code to see how nbd client works. > > ----- Mail original ----- > De: "Wolfgang Bumiller" <w.bumil...@proxmox.com> > À: "aderumier" <aderum...@odiso.com> > Cc: "pve-devel" <pve-devel@pve.proxmox.com>, "dietmar" <diet...@proxmox.com> > Envoyé: Mercredi 21 Décembre 2016 12:20:28 > Objet: Re: [pve-devel] [PATCH 2/6] qemu_drive_mirror : handle multiple jobs > > > On December 21, 2016 at 10:51 AM Alexandre DERUMIER <aderum...@odiso.com> > > wrote: > > > > > > >>IIRC that was the only blocker. > > >> > > >>Basically the patchset has to work *without* tcp_ping() since it is an > > >>unreliable check, and then we still have to catch failing connections > > >>_correctly_. (There's no point in knowing that "some time in the past > > >>you were able to connect to something which may or may not have been a > > >>qemu nbd server", we need to know whether the drive-mirror job itself > > >>was able to connect.) > > > > For me, the mirror job auto abort if connection is failing during the > > migration. Do you see another behaviour ? > > That covers one problem. IIRC the disk-deletion problem was that due > to wrong [] usage around an ipv6 address it could not connect in the > first place and didn't error as I would have hoped. > > > > > the tcp_ping was just before launching the drive mirror command, because it > > was hanging in this case. > > Then it can still hang if the destination disappears between tcp_ping() > and the `drive-mirror` command, so I'd rather get better behavior on qemu's > side. It needs a time-out or a way to cancel it or something. > > _______________________________________________ > pve-devel mailing list > pve-devel@pve.proxmox.com > http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel > > _______________________________________________ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel