FYI I figured this out in the end. It turns out that writing OOPSes is deferred to a thread pool. However, the AMQP port was not open on the firewall, so once 10 OOPSes were queued up, it was hanging.
D'oh ... On Tuesday 20 December 2011 15:30:28 Julian Edwards wrote: > Hi folks > > As you may know, Poppy is the Twisted-based FTP/SFTP server for uploading > packages to Soyuz. I recently landed a change to fix its logging (along with > a few other Twisted-based services such as the librarian, branch-puller > etc) so that it uses the python-oops stuff correctly. > > It was released last Friday and within 10 minutes the instance on the PPA > machine (germanium) went into a weird state where it was unable to contact > the xmlrpc-private auth service running on the appservers, and hence all > SFTP requests fail. > > Here is an example from the log of an unsuccessful XMLRPC request: > > 2011-12-16 10:27:38+0000 [SSHService ssh-userauth on > KeepAliveSettingSSHServerTransport (TimeoutProtocol)] Starting factory > <twisted.web.xmlrpc._QueryFactory instance at 0x7b53248> > ... wait ... > 2011-12-16 10:28:08+0000 [-] [Failure instance: Traceback (failure with no > frames): <class 'twisted.internet.error.TimeoutError'>: User timeout caused > connection failure. > ] > 2011-12-16 10:28:08+0000 [-] udienz failed auth publickey > 2011-12-16 10:28:08+0000 [-] unauthorized login: unable to get avatar id > 2011-12-16 10:28:08+0000 [-] Stopping factory > <twisted.web.xmlrpc._QueryFactory instance at 0x7b53248> > > And here is one that works: > > 2011-12-16 10:15:11+0000 [SSHService ssh-userauth on > KeepAliveSettingSSHServerTransport (TimeoutProtocol)] Starting factory > <twisted.web.xmlrpc._QueryFactory instance at 0x6737128> > 2011-12-16 10:15:11+0000 [QueryProtocol,client] Stopping factory > <twisted.web.xmlrpc._QueryFactory instance at 0x6737128> > > Because it's a 30 second timeout, this timeout error message is indicative > in my experience of the TCP SYN package not being ACKed (timeouts for open > connections are much, much longer). However, restarting the Poppy instance > will make things work again, so I'm not sure whether it's a code problem or > an infrastructure problem. > > We are currently running a very old revision of code on germanium so it's > blocking further rollouts on there. Oddly, this only affects the PPA > machine, not the Poppy on cocoplum (the Ubuntu machine). I've also blasted > hundreds of connections at the dogfood box to try and make it fail, and it > doesn't. It's also worth noting that the instance on germanium also > occasionally gets problems contacting the keyserver when it's trying to > verify GPG signatures, which requires a restart to fix. > > Since I'm at a total loss as to what to do next, I am going to put the > latest code back on germanium tomorrow and run it in production again so I > can gather more data when it goes wrong. > > In the meantime, if anyone can come up with any ideas on how to figure out > what's going on here I'd really appreciate it! > > Cheers > J > > _______________________________________________ > Mailing list: https://launchpad.net/~launchpad-dev > Post to : launchpad-dev@lists.launchpad.net > Unsubscribe : https://launchpad.net/~launchpad-dev > More help : https://help.launchpad.net/ListHelp _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp