I suspect some of my recent emails have seemed to be jumping all over the place - and at the surface that is so. However they are all tied together at a lower layer - I don't want to cause confusion, so its time to tie them together and try to share the patterns I'm seeing as I review our architecture during this early boot-strap period. I'm largely seeing things on-demand as issues crop up, but nevertheless I'm getting (I think) a decent coverage.
The brilliant : - many folk have come up to me and said words roughly equivalent to 'wow, I thought I was alone caring about performance/downtime/etc' :- with the number of folk that want to really have us shine in this area, I have /no/ doubt that we'll achieve it. Launchpad is no worse off than bzr was back before performance was made into a key development metric, and like bzr, I expect a rapid improvement in Launchpad as we start to assess things more critically. - Launchpad /is/ very functional and does many things its users want. So much so that users want to add more and more things to Launchpad :) - this was a common theme at the Epic. I look forward to having our system so good that we can rapidly serve these user requests. The good: - we have some very powerful diagnostic tools, and they are improving. - much of our system has a solid scaling and availability story: we only have ~ 5 action items to get no-downtime upgrades, and only one of them needs non-trivial development. - Our code base is really very approachable; for all that its of a fairly decent size, the chains to find causes of issues are pretty shallow. The bad: - we have immensely strong coupling occuring in the system. Recently observed pain points: * The DB uses triggers which makes the ORM <-> DB layer more fragile and less direct (10 hours of testfix due to a storm bug only possible with triggers) * It is non-trivial to do out-of-transaction events: actions are very tightly coupled to their context, either in the DB or in the webapp. The jobs systems are of sufficiently high friction that they aren't the first tool developers reach for, and so they aren't immediately useful * Actioning a configuration change takes approximately 2 hours, unless the stock process is bypassed, in which case it only takes 15 minutes! - related to the coupling story we are missing fairly standard infrastructure for an internet scale system: a queuing system (Jeroen is a great person to talk to about rabbit, with his MQSeries experiences); high relevance searching; system status dashboard; automated rollouts; write-scalablilty [e.g. sharding/partitioning]; callbacks to user code. Many of these are coming, or having requirements assessed at the moment. The ugly: - we have really high friction around making changes, which leads to both not doing small twaks and big changes which are high risk which leads to.. more friction. The new merge and deployment stories will help a lot, but also, I think we need to really just make it easy to improve things. Curtis gave a great lightning talk at the Epic covering how small changes led to him doing the most bug fixes per month-long cycle: we should all do more of that. - We have interlinked performance problems; the DB is a choke point for writes, and we write a lot - enough that when a backup goes wrong, we have a timeout spike on lpnet and edge, because we have little headroom. Queries that take 6000ms on staging (when in cache) take 14000ms on prod slaves, and 24000ms or more on prod main : we're running into contention : we have so much load things are slower just because of the load. And, we have operations that take seconds to complete, which adds to the load. Further because things are slow, its very hard to spot new slowdowns, because the situation normal is slow. - we have pages for which the minimum time to complete is more than 5 seconds on the server. Server render time is not a great surrogate for user experience - there are many things which can go wrong when delivering stuff to users; however great server render times are a necessary condition to a great user experience. - we have baked-in scalability issues in some areas, which will require time to track down and fix. I'm going to put the design guidelines I proposed at the Epic online next week, and after that start working on scaling/performance guidelines as a specific subtopic. I hope the above all ties together well; the emails about different bits of the system I've been sending out have been largely driven by specific scaling issues I've uncovered as I dig into the search performance / relevance story. The specific things I'm suggesting changes to are things where Launchpad is slow *because* of how we've solved engineering / design challenges, rather than because of the sheer number of users we have. As I said at the Epic, if we don't focus, we'll churn and have a hard time doing anything; however when there are multiple interlocking causes prevent a problem being solved, we will need to spread out and solve them: like Stop-the-line in LEAN, the first *really fast, scalable* thing in a system is the hardest. (excluding +opstats, ok?). Right now, my personal focus is on three things, with no well defined priority between them: - lowering the hard timeout [ensuring we don't have requests hogging resources, failing-faster-when-we-fail, giving us a back-stop to prevent creeping slowness] - search performance [one of the key pages that fails a lot and is blocking hard timeout lowering is searching] - our development story [the slower we iterate, the slower we improve] This includes the RFWTAD QA/deployment story, the new landing system Gary proposed, test suite overhead etc] Of course, I have a forth thing, which is more important than those three: helping you guys solve problems in design or implementation; I've done a bit of this so far, and I'm keen to do more. Cheers, Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

