Lars Marowsky-Bree wrote:
On 2006-05-19T13:28:08, Alan Robertson <[EMAIL PROTECTED]> wrote:Just a quick note because I don't want to appear unresponsive, but I don't think I'll have too much time for this thread within the next 1-2 weeks at least, or none which can do the discussion justice, because ofour release schedules.(Which also causes my gut to feel that we should still be mostly in a consolidation phase and be polishing/finishing 2.0.x and clean up the code base instead of already starting to develop major new features, but that's just me and may be caused by release stress.)
Probably. But, we have development schedules to meet. We can't sit on our hands until SUSE gets their act together ;-)
With that disclaimer, just some real quick notes:You mean a clusters of clusters I think? My first guess is that this will likely have to wait until release 3.x. One could design a new separate system for this on top of our current code if one had the resources. We don't.I believe it is more appropriate to treat these clusters as stacked/layered clusters.I'm not entirely sure I agree here, for two reasons. The first is that I'm not thinking that transforming our stack to support layered clusters will take 5 staff years. The initial phases don't require much more but that the top-level cluster treats each site or virtualized cluster-member as a resource, and that the lower-level cluster has some way of tieing into the top-level for fencing and some membership computation. Actually, we already provide some of these mechanisms (ie, for a top-level cluster to treat us as a resource, say using CIM to control us).
OK. I _know_ it will cost about 3 staff months just to update the core heartbeat code. I've looked at it several times in the past. Similarly I would expect it to cost a similar amount for each of our other dozen or so other components - for the same reason. Now we're up to 3 staff years just to get started.
And probably another 2 staff months to make all the pieces fit back together again, and a couple more to fix the problems that show up in testing.
Now, you get to start the part of writing the cluster of clusters code, and 40 staff months have already been spent. And, this is a HUGE instability issue.
(The one thing which is somewhat more difficult right now was if we wanted to run instances from several layers within the same OS instance, because our various communication channels/fifos etc don't support that, but that doesn't occur for Xen/VMware style layered clusters and could also be addressed quick&dirty for DR clusters by having the "top-level" run on different nodes. But, that's mostly an implementation detail and not a major design consideration.)
AFAIK, this wouldn't work. It's my understanding that the top layer has to be a member of the lower layer(s). In fact, it probably has to be a "resource" of the lower layer...
The second argument is more one of priorities: if we don't have the resources to do it properly, we might declare it "out of scope" for the time being (say, 2.0.x) and instead focus our resources on furthering what we're really good at, namely being a robust cluster for a single site. (But, say, encourage others to continue to develop software to integrate with us as such.)whose name I forgot. Stretch clusters are, to the best of my knowledge, somewhat limitted because they essentially pretend that it is a flat structure, but it isn't in practice...What kinds of limitations did you have in mind here?Basically, most of them stem from treating a discontinuous stretched cluster as a single one - sort of like treating an NC-NUMA architectureas CC-SMP.It beings with the heartbeating over the WAN link overhead and different latencies compared to the LAN heartbeat, to the differences to STONITH, that sites may differ in their configuration (rarely are they 100% identical - though we can provide for this better than other stretched clusters by rule-affected instance attributes), fault isolation between the sites, and ends somewhere with the fact that you want them to be independent for anything local.
STONITH doesn't work in a stretch cluster. Or at least not reliably. Other techniques have to be used. It is worth noting that having STONITH failure be consider a non-blocker for going on is still a good option to consider for local clusters (it has been requested before).
Actually, your initial steps plus some of the optional ones (human override, among others) I consider useful regardless of their application to stretched clusters.
Me too :-) Thanks!
[Resource level quorums (quora?) would be an even nicer extension, that would be awesome for stretch clusters but I'm not yet ready for that (nor ready to ask Andrew for it) either]Resource level quorum, or better put: a quorum hierarchy affecting different parts of the system, actually _is_ a sort-of layered cluster, just all globbed into a single flat domain (for everything but quorum) ;-) This is already quite nice for a single site though - where you don't want the fact that some nodes went down affect resources which wouldn't have touched those nodes anyway. So yes, I guess that's actually a quite reasonable thing to ask. If it really takes 5-10 staff years to implement clusters of clusters, achieving the same functionality by extending the various parts so that a stretched cluster will work better will likely not be all that much cheaper.
It is cheaper. It is _much_ cheaper. IBM does a lot of stretch clusters for customers. I've asked some of the deployment experts I know of (and they'd REALLY good at this) to look over the proposal.
And, pointing to prior art, not just in how other products implement this (and I know "Blueprints for High Availability" has something very useful to say on that topic too, but I have given away my copy to a colleague right now :-/), but in how other areas have implemented this (think Internet Routing Architecture, grid architecture, or CPU schedulers/memory allocators for NC-NUMA), you'll find that _very_ few scenarios have opted to pretend that a discontinuous/hierarchial structure is flat.
I'll look for this in "Blueprints" I still have my copy. That's a great suggestion. Thanks!
(Even the catholic church eventually conceeded that the earth is round ;-)
s/catholic/Roman Catholic/ ;-)<pedantic-note>catholic (lower case) just means "universal". Roman Catholic is the proper name for your intent</pedantic-note>
But, of course, I'm fine with these enhancements being done anyway, as long as they don't negatively impact what I still consider our primary focus (local-area clusters), neither in code nor in the resource staffing. As I said, I don't really have time for this discussion in the depth it deserves right now, and for that I apologize. But, I _do_ have opinions I'd like to bring in ;-) I hope it is not too urgent.
This is a topic I've brought up in the past. It has been on our radar screen for a year or more. You think the things we're going to do will be useful anyway. I agree.
I'm going to have stretch cluster experts evaluate our proposal anyway - because they speak from extensive personal experience. Their opinion is worth much more than mine - and probably even worth a little more than yours (if you'll forgive me for saying so).
Let's see what they have to say.
--
Alan Robertson <[EMAIL PROTECTED]>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________________ Linux-HA-Dev: [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
