Hi folks As you may know, Poppy is the Twisted-based FTP/SFTP server for uploading packages to Soyuz. I recently landed a change to fix its logging (along with a few other Twisted-based services such as the librarian, branch-puller etc) so that it uses the python-oops stuff correctly.
It was released last Friday and within 10 minutes the instance on the PPA machine (germanium) went into a weird state where it was unable to contact the xmlrpc-private auth service running on the appservers, and hence all SFTP requests fail. Here is an example from the log of an unsuccessful XMLRPC request: 2011-12-16 10:27:38+0000 [SSHService ssh-userauth on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] Starting factory <twisted.web.xmlrpc._QueryFactory instance at 0x7b53248> ... wait ... 2011-12-16 10:28:08+0000 [-] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.TimeoutError'>: User timeout caused connection failure. ] 2011-12-16 10:28:08+0000 [-] udienz failed auth publickey 2011-12-16 10:28:08+0000 [-] unauthorized login: unable to get avatar id 2011-12-16 10:28:08+0000 [-] Stopping factory <twisted.web.xmlrpc._QueryFactory instance at 0x7b53248> And here is one that works: 2011-12-16 10:15:11+0000 [SSHService ssh-userauth on KeepAliveSettingSSHServerTransport (TimeoutProtocol)] Starting factory <twisted.web.xmlrpc._QueryFactory instance at 0x6737128> 2011-12-16 10:15:11+0000 [QueryProtocol,client] Stopping factory <twisted.web.xmlrpc._QueryFactory instance at 0x6737128> Because it's a 30 second timeout, this timeout error message is indicative in my experience of the TCP SYN package not being ACKed (timeouts for open connections are much, much longer). However, restarting the Poppy instance will make things work again, so I'm not sure whether it's a code problem or an infrastructure problem. We are currently running a very old revision of code on germanium so it's blocking further rollouts on there. Oddly, this only affects the PPA machine, not the Poppy on cocoplum (the Ubuntu machine). I've also blasted hundreds of connections at the dogfood box to try and make it fail, and it doesn't. It's also worth noting that the instance on germanium also occasionally gets problems contacting the keyserver when it's trying to verify GPG signatures, which requires a restart to fix. Since I'm at a total loss as to what to do next, I am going to put the latest code back on germanium tomorrow and run it in production again so I can gather more data when it goes wrong. In the meantime, if anyone can come up with any ideas on how to figure out what's going on here I'd really appreciate it! Cheers J _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : launchpad-dev@lists.launchpad.net Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp