On Fri, 19 Nov 2010 15:17:15 +1300, Michael Hudson <[email protected]> wrote: > On Fri, 19 Nov 2010 01:16:26 +0000, Max Bowsher <[email protected]> wrote: > > It would appear that, as far as I can see, no code imports have > > successfully completed since some time on 2010-11-15. > > > > The code import machines are all full of running jobs, all of which > > appear to sit around doing nothing much, until they get canceled an hour > > after they start. > > > > Because every dispatched importd job is now taking 60 minutes, which is > > much longer than the average when things work properly, there are now > > *very many* imports which are queuing for an importd execution, hence > > the web UI is displaying "The next import is scheduled to run as soon as > > possible." for imports which won't actually be attempted for hours? > > days? (and will then fail.) > > I've got a few more bits of information to add to this. The logs look > like this: > > 2010-11-18 22:35:30 INFO [chan bzr SocketAsChannelAdapter] Opened sftp > connection (server version 3) > 2010-11-18 22:35:32 INFO [chan bzr SocketAsChannelAdapter] Opened sftp > connection (server version 3) > Exception KeyboardInterrupt: KeyboardInterrupt() in <function terminate at > 0x9ba86bc> ignored > 2010-11-18 23:35:35 INFO [chan bzr SocketAsChannelAdapter] Opened sftp > connection (server version 3) > 2010-11-18 23:35:35 INFO [chan bzr SocketAsChannelAdapter] Opened sftp > connection (server version 3) > Import failed: > Traceback (most recent call last): > Failure: twisted.internet.error.TimeoutError: User timeout caused connection > failure. > > The "Opened sftp connection" log lines are new-ish, but they have been > present for at least a few weeks, so they are not that closely related > to the issue. > > The issue appears on staging too, which suggests to me that it is more > likely to be a code change than an environmental one. > > The importd user on the importd slaves can still sftp to the central > store, at least in a trivial way. > > Although the problem appeared soon after the dustup with the XML-RPC > service over the weekend, it doesn't actually seem to be related: there > were some successful imports after all that drama. > > There is no LOSA around today, which makes finding more information > hard. > > Now some guesswork. > > There was a nodowntime rollout on the 15th. I bet it introduced the > problem. > > My utter WAG is that it was the upgrade to bzr 2.2.1 that caused the > problem.
My guesses were correct. Thanks to a friendly sysadmin, we're now running 2.2.0 on the code import slaves again, and the backlog is being churned through. https://bugs.launchpad.net/bzr/+bug/677305 seems to have been the underlying problem. Cheers, mwh _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

