Hi, * Hrvoje Ribicic <[email protected]> [2014-01-22 11:47:03 +0100]:
> On Mon, Jan 20, 2014 at 4:47 PM, Dimitris Aragiorgis <[email protected]>wrote: > > > Hi all! > > > > > > Recently we observed some strange TypeErrors in node-daemon.log related to > > QMP message handling. Specifically the exceptions were inside > > GetInstanceInfo(). > > > > Please do the following test: > > > > 1) Client A connects to qmp socket with socat > > 2) Client A gets greeting message {"QMP": {"version": ..} > > 3) Client A waits (select on the socket's fd) > > 4) Client B tries to connect to the *same* qmp socket with socat > > 5) Client B does *NOT* get any greating message > > 6) Client B waits (select on the socket's fd) > > 7) Client B closes connection (kill socat) > > 8) Client A quits too > > 9) Client C connects to qmp socket > > 10) Client C gets *two* greeting messages!!! > > > > This case is reproducable in Ganeti: > > 1) Allocator runs and invokes GetAllInstancesInfo() > > 2) GetInstanceInfo() creates a QMPConnection() > > 3) Meanwhile a client wants to get instance runtime info > > 4) GetInstanceInfo() creates a new QMPConnection() but _Recv() blocks when > > trying to read from socket and a timeout exception is raised. This > > is catched and raised as HypervisorError which is *silently* catched inside > > GetInstanceInfo() > > 5) First connection closes > > 6) When a third connection will arive it will get *two* greetings! > > 7) The first greeting will be consumed inside connect() > > 8) The second will be consumed as response to qmp_capabilities execute > > command > > 9) info cpus will get {'return': {}}!! which will result to TypeError. > > > > After digging qemu source we found that monitor's output buffer is > > shared between clients. Only one client can interact with the monitor. > > The others are considered as blocked. Still their greeting message is > > written to the output buffer and a callback is registered to flush the > > message to the socket when possible. This means that if a client closes > > the connection before getting the greeting, these data remain in the > > buffer and will be flushed upon a future connection. > > > > To fix this on the Ganeti side, first of all, a finally: section that > > closes > > the socket must be added for each QMPConnection(), that practically leaves > > smaller timeslot for more than one concurrent connections. Additionally we > > should parse as many greetings available when starting the connection, > > and finally parse the output of qmp_capabilities command. > > > > Do you agree with all the above? > > > > Should I proceed with a patch? Against stable 2.9? > > > > I believe that the bug should be reported to the QEMU community to hear > their response first. As you have more details on how to do this and a > reproduction case easily available, I hope you might be able to do it? > If not, I can do the deed as well. > Sorry for the delay. We were doing some digging in the QEMU source in order to find out why this is happening. Yes, we were already planning to report this misbehavior on qemu-devel today, so we will have news soon. > But after this is very likely registered as a bug, a patch would be > welcome. If this happens early enough, even stable-2.8 could be a good > target for the patch - I'll try and reproduce the bug there. > > Primarilly I thought 2.9 (and not 2.8) because this is the debian wheezy version where qemu 1.7 is available too. I was wondering if a quick and dirty workaround is the better way to go. I mean the interaction with qmp inside GetInstanceInfo() is kinda "optional" because cpu/memory values are also statically available. So if we catch this TypeError there (and add some long comment) it wouldn't make so much difference. But there is also the QmpConnection() related to spice password.. A proper fix with the current code is more difficult since we have to parse unknown number of greeting messages upon qmp connect() until we receive a socket.timeout, meaning that there is no other data available (since socket is blocking). I think the cost will be unacceptable! What we could do is flush the entire buffer after parsing the first greeting message to flush. And then execute "qmp_capabilities". > > > > PS: All this have been tested with QEMU 1.7 > > As Apollon said this is NOT reproducable with QEMU 1.1. Which means this is certainly a bug introduced in later versions. Let's hope it gets fixed upstream soon enough.. Since the code inside QEMU is shared, this misbehavior can be observed with QEMU monitor as well (but has no real effect on Ganeti, since we do not parse any greeting messages there). > > Thanks for the detailed analysis! > :) So what do you think? Cheers, dimara PS: Note that in 2.10, along with the hotplug implementation, QMP related code has been refactored since MonitorSocket class has been introduced. So this will need a bit of effort on your side to forward port it.
signature.asc
Description: Digital signature
