On Thu, May 10, 2012 at 1:23 AM, Gavin Panella <[email protected]> wrote:
>>> - zero downtime: rolling upgrades. >> >> This isn't the same - a stateful pserv will have short downtime per >> pserv; stateless won't. > > I meant zero downtime across the cluster as a whole. Individual parts > may blip but the cluster as a whole stays available. Even large schema > changes cause only a degradation of service in one partition at a > time. Strictly speaking, this is downtime, and users will perceive it as such. Its not complete downtime - Launchpad calls this 'partial downtime' , and yes its better than complete downtime. So a single bit store will have the schema changes requiring something like FDT to be short-and-sweet. OTOH the database is going to be tiny: 100K nodes is 100K node rows + (say) 400K MAC nodes. If every node were to have 1K of data, both node row + mac rows, we'd have a 100MB DB - thats /tiny/. Add in fill factor of 25% on heap pages, call it 125MB - still extremely small. 1M nodes -> 1.25GB DB (time to put some dedicated RAM in the DB server). >> >>> - a very high degree of scalability. >> >> Seems the same to me, except that we don't need to write a stateless >> API proxy - so it things to create. >> >>> My dodgy diagram, attached, and which probably employs zero >>> pre-existing iconographies, tries to convey some of this. >> >> Perhaps I'm missing something, but I don't see pserv on that diagram? > > Yeah, sigh, I f**ked up. The big box named MAAS with the cloud haircut > was meant to be (web API + metadata + cobbler-assimilated). kk >> I don't see any particular a-priori reason to avoid having N >> state-maintaining services cooperating to provide MAAS as a whole - >> thats very much what I advocate - an SOA approach; but OTOH when you >> have a state-maintaining service, that service needs an HA story, it >> needs failure-mode management in its clients, it needs a >> dealing-with-absent-services story, and it needs a backup story. I >> don't think the MAAS dataset is large enough or complex for these >> things to be a good tradeoff vs maintaining all your state in a HA >> core service, with horizontally scaling helper services interrogating >> it as you scale. > > Okay, that's fair. I think it will be a problem eventually. Servers > are inexorably getting smaller. I agree that it could be a problem eventually. AIUI we have roughly 3 goals for development of MAAS today: - optimise market adoption: MAAS is an enabler for Juju, and as such the wider adoption MAAS gets, the wider adoption Juju can see for bare-metal workloads - deliver a system capable of robustly handling very large new clouds, with the next size goal being 100K nodes - deliver the next iteration -reliably- in 4-5 months (we need time for the dust to settle at the end of the cycle, last minute stuff is not good) Aiming for 100K nodes supported means, to me, that we need to design for 1M nodes supported. A 1-2GB DB could be served, with the entire thing hot in RAM, from extremely modest hardware. Distribute out the provisioning agents in batches of (say) 20K nodes, and you'll have 50 provisioning agents + a rabbit getting ~ 60 messages/second. We know rabbit scales to 10K+ messages/second. I assume that folk won't be super-stingy for the core infrastructure nodes - we can't expect them to buy super-big machines for things doing this overhead, but conversely, we can expect them to be buying modern machines and dedicating them to the task. http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server says "You may be limited to approximately 100 transaction commits per second per client in situations where you don't have such a durable write cache (and perhaps only 500/second even with lots of clients). " When talking about speeds *without* a RAID cache battery - e.g. non-hardware-RAID; Postgresql 9.2 has some benchmarks that aim to improve this and are showing 12K commits/second on the same environment. (See the 'group commit' feature). So, modelled like this, do you see a scaling issue with a single control node? I don't, but if I'm missing something, I'd sure like to know! I totally grant that there is increased dependency on the uptime of the core server with a centralised model. So, what could we do to mitigate that? Well, we could combine both proposals and say: - cobbler dies - dhcp/tftp/dnsmasq will be managed via celery - a 'provider' - these can be run in active-active HA mode (uptime sensitive installs) - we can run HA rabbit, or two non-HA rabbits (uptime sensitive installs) - A MAAS can run many providers - We will provide an API proxy to talk to multiple MAAS for folk that want to partition their environment at the MAAS level rather than the provider level. The only icky point there, then is that authentication would be replicated out to each MAAS provider and we'd have to do glue to do that (and default-settings for things and so forth). Another way to address this, is to take the last bullet point there and do what Amazon does, say something like: - "For HA run services in multiple regions, each region is totally independent." - A single MAAS install is a 'region', its moderately HA itself, it won't go down spuriously or casually - install two MAAS's, and your API clients (like Juju, and yes, our web UI) can be told of both, and configure what they want appropriately. I think doing this, and not providing a single proxy, is actually better, for a few reasons: - its a pattern cloud-api consumers are used to (see AWS :P) - the MAAS clusters will be truely independent, so a failure on one cannot cascade (e.g. via bad state updates) to any other one - we have less work to do. >> I guess the key thing you allude to, is that you could in principle >> permit provisioning to happen when the main MAAS server is AWOL, but >> that implies some significant complexity around authentication - and a >> state synchronisation mechanism for when MAAS itself comes back. > > I don't think any state synchronisation would be necessary. Well... in > one direction only: whatever global state is needed should be pushed > out and/or pulled by the (API+...) services. It should never move the > other way. There are two sets of data - the ick I refer to above: - usercodes, default settings, cluster wide /anything/ is one set - node specific data, which scale as you add nodes If we, for instance, were to have a way of saying 'these nodes are in group 'blue'', then that is something which has to synchronise across all the state stores in the system, or be centralised. If its centralised, it needs to know when nodes are deleted, if its not centralised, then clients need to handle a particular sub-node being AWOL so that they can update it when it comes back. (One simple way of handling it is to say to the user 'try again later', but that then gets back to 'will users be blocked when a single provisioning agent is down?'). > Coming out of Oakland seems to be the message that MAAS should have a > simpler - than now, even - user management story, which reduces this > problem further. > > Overall, I'm suggesting not putting the important parts all in one > place, and instead putting a unified API front (which would be the > stupid stateless bit) on a bunch of (API+) services. >> If we come back to the core of MAAS - a single tenant API provider for >> provisioning hardware like a cloud, this doesn't seem justified to me: >> even a very large environment say 100K nodes) won't have a high >> frequency of machine role turnover (100's of machines/minute) : >> machines will be brought up and put into openstack or hadoop, and >> within that environment get lots of use; periodically maintenance will >> happen, gracefully, but thats still going to be something where the >> impact of a short outage at the MAAS controller has minimal impact. >> >> (Sketch numbers for my model: each piece of hardware gets deployed for >> a month or more at a time, except for staging/test environments which >> are a) relatively small and b) torn down and replaced a lot) >> 100K machines >> 100K * (at most) 12 -- <= 1.2M allocations a year >> <= 1.2M deallocations a year >> 525600 minutes/year >> -> about 3 allocation-or-deallocation operations per minute, on average. >> >> A 10 minute outage, is about 30 queued operations. (Or 300 for a 1M node provider) > An imagined MAAS reseller, and its reputation, would probably want > better. Also, it's a cloud-like environment; if a machine can be > deployed in a few minutes then they'll be used like people use > instances in AWS, i.e. a lot more provisioning operations that you've > guessed at. The simpler story we're to focus on is MAAS environment where every user is an admin: that implies no multi-tenant environments, and resale of a single tenant MAAS story becomes limited to the size of a single tenant, where the risk to the reseller of a widespread outage is limited to one client at a time. We need to understand what users want so that we can make good decisions about building it for them - What feedback have we had since MAAS was announced? What sort of things are people trying to do? At what scale do they say 'right, I'll use MAAS to bring up openstack, and fiddle on top of *that*'. In the absence of that data, we're reduce to putting forward what we think users want, which is always a bit risky :) We have though, 2 primary use cases we want to enable for Juju (here, Juju's needs are our proxy for user needs): * Run up openstack on metal * Run up hadoop on metal The former needs low machine turnover (basically install then forget until BIOS upgrades are needed, and they would be rolling by nature, + a small test of test machines for well, openstack admin testing). The latter also needs low machine turnover, for the same reason. Yes, it is entirely possible there are users out there with 100K or 1M nodes, that want to use MAAS multi-tenant, or use MAAS with large clusters *and* high node use change rates. I propose that for the former we advise them to bring up openstack on top of MAAS: that gives them robust and reliable user management, quotas etc. For the latter, lets wait and see. The foundations I'm proposing should (with hardware RAID in the MAAS box) trivially handle 5K transactions/second through postgresql itself, we can horizontally scale the HTTP interface, rabbit will handle another order of magnitude messages on top of that. The average I estimated for 1-month allocations on 1M nodes was 30 transactions/minute; thats 0.5/second - we have 5 orders of magnitude headroom, or say lease times of of hour. -Rob -- Mailing list: https://launchpad.net/~maas-devel Post to : [email protected] Unsubscribe : https://launchpad.net/~maas-devel More help : https://help.launchpad.net/ListHelp

