On Wed, 2020-02-12 at 15:39 +0100, Jehan-Guillaume de Rorthais wrote: > Hi, > > As the PAF RA maintainer, I would like to discuss (sorry, again) > something > really painful: master scores and infinity. > > PAF is a RA for PostgreSQL. The best known value to pick a master is > the > PostgreSQL's LSN (Log Sequence Number) which is a 64bits incremental > counter. > LSN is related to the volume of data written to the databases since > the > instance creation. > > Each instance in the cluster (promoted or standby) reports its own > LSN: > * the promoted reports its last written LSN > * standbies report the last LSN they received > > That's why LSN is the natural "master score" when there is no > promoted clone > around. Therefore, the lag of a standby is measured in bytes, based > on this LSN. > > Pacemaker master scores must fit between -1000000 and 1000000. > Mapping this > to LSN is impossible. Even if we can gather LSN diff between > standbies (which > would require a shared variable somewhere), this would be too small. > 1000000 is > only 1MB worth of lag. If we consider the minimal size of records in > this log > sequence number, we could stretch this to 24MB, but it's still way > too small > compared to some eg. network-bound workload where standby can lag way > much more > than few MB. > > Because of this, we use (and abuse for other purposes) notifications > to elect > the best standby: > > 0. Pacemaker decides to promote one clone > 1.1. during pre-promote, every clone set their LSN as a private > attribute > 1.2. the clone-to-promote track what clone takes part in the election > in a > private attribute > 2. during the promotion, the clone-to-promote compares its LSN with > LSN > set in 1.1 for each clone tracked in 1.2. > 3. if one clone LSN is greater than the local LSN > 3.1 set a greater master score for the best candidate > 3.2 returns an error > 3.3 Pacemaker loops to 0 > > Higher bounds for ±INF would help a lot to make this simpler. After > the primary > is confirmed dead, all standby might just update how far they are > from the > latest checkpoint published by the master few seconds or minutes ago. > > INT_MAX would set the working interval to ±2GB. Producing 2GB of > worth of data > in few seconds/minutes is possible, but considering the minimal XLOG > record, this would push to 48GB. Good enough I suppose. > > INT64_MAX would set the working interval to...±8EB. Here, no mater > what you > choose as master score, you still have some safety :) > > So you think this is something worth working on? Is there some traps > on the way > that forbid using INT_MAX or INT64_MAX? Should I try to build a PoC > to discuss > it?
I do think it makes sense to expand the range, but there are some fuzzy concerns. One reason the current range is so small is that it allows summing a large number of scores without worrying about the possibility of integer overflow. I'm not sure how important that is to the current code but it's something that would take a lot of tedious inspection of the scheduler code to make sure it would be OK. In principle a 64-bit range makes sense to me. I think "INFINITY" should be slightly less than half the max so at least two scores could be added without concern, and then we could try to ensure that we never add more than 2 scores at a time (using a function that checks for infinity). Alternatively if we come up with a code object for "score" that has a 64-bit int and a separate bit flag for infinity, we could use the full range. Unfortunately any change in the score will break backward compatibility in the public C API, so it will have to be done when we are ready to release a bunch of such changes. It would likely be a "2.1.0" release, and probably not until 1-2 years from now. At least that gives us time to investigate and come up with a design. > Beside this master score limit, we suffer from these other > constraints: > > * attrd_updater is highly asynchronous: > * values are not yet available locally when the command exit > * ...neither they are from remote node > * we had to wrap it in a loop that wait for the change to become > available > locally. There's an RFE to offer a synchronous option to attrd_updater -- which you knew already since you submitted it :) but I'll mention it in case anyone else wants to follow it: https://bugs.clusterlabs.org/show_bug.cgi?id=5347 It is definitely a goal, the question is always just developer time. > * notification actions return code are ignored It might be useful to support "on-fail" for the notify operation, and default to "ignore" to preserve current behavior. However the notify action is unique since it is associated with some other action. Would a single "on-fail" for all notifications be enough, or would we need some way to set different "on-fail" values for pre/post and start/stop notifications? If we did support on-fail for notify, that means that a default on-fail in op_defaults would begin to apply to notify. That might be unexpected, especially for configurations that have long worked as-is but might start causing problems in this case. I would want to wait at least until a minor version bump (2.1.0), or maybe even a major bump (3.0.0), though we could potentially make it available as a compile- time option in the meantime. Feel free to open an RFE. > * OCF_RESKEY_CRM_meta_notify_* are available (officially) only during > notification action That's a good question, whether the start/stop should be guaranteed to have it as well. One question would be whether to use the pre- or post- values. Not directly related, but in the same vein, Andrew Beekhof proposed a new promotable clone type, where promotion scores are discovered ahead of time rather than after starting instances in slave mode. The idea would be to have a new "discover" action in resource agents that would output the master score (which would be called before starting any instances), and then on one instance selected to be promoted, another new action (like "bootstrap") would be called to do some initial start- up that all instances need, before the cluster started all the other instances normally (whether start or start+promote for multi-master). That would be a large effort -- note Beekhof was not volunteering to do it. :) > These three points might probably be discussed in dedicated thread > though. > > Regards, -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/developers ClusterLabs home: https://www.clusterlabs.org/