Heyho, I ran into the same issue before and I think our scheduling code should be an Actor. We could microbenchmark it to assure it can happily schedule a large amount of actions per second to not become a bottleneck.
+1 for actorizing the LB Cheers, Markus Von meinem iPhone gesendet > Am 10.10.2017 um 13:28 schrieb Tyson Norris <[email protected]>: > > Hi - > Following up on this, I’ve been working on a PR. > > One issue I’ve run into (which may be problematic in other scheduling > scenarios) is that the scheduling in LoadBalancerService doesn’t respect the > new async nature of activation counting in LoadBalancerData. At least I think > this is a good description. > > Specifically, I am creating a test that submits activations via > LoadBalancer.publish, and I end up with 10 activations scheduled on invoker0, > even though I use an invokerBusyThreshold of 1. > It would only occur when concurrent requests (or very short time between?) > arrive at the same controller, I think. (Otherwise the counts can sync up > quickly enough) > I’ll work more on testing it. > > Assuming this (dealing with async counters) is the problem, I’m not exactly > sure how to deal with it. Some options may include: > - change LoadBalancer to an actor, so that local counter states can be easier > managed (these would still need to replicate, but at least locally it would > do the right thing) > - coordinate the schedule + setupActivation calls to also rely on some local > state for activations that should be counted but have not yet been processed > within LoadBalancerData > > Any suggestions in this area would be great. > > Thanks > Tyson > > > >> On Oct 6, 2017, at 11:04 AM, Tyson Norris <[email protected]> wrote: >> >> With many invokers, there is less data exposed to rebalancing operations, >> since the invoker topics will only ever receive enough activations that can >> be processed “immediately", currently set to 16. The single backlog topic >> would only be consumed by the controller (not any invoker), and the invokers >> would only consumer their respective “process immediately” topic - which >> effectively has no, or very little, backlog - 16 max. My suggestion is that >> having multiple backlogs is an unnecessary problem, regardless of how many >> invokers there are. >> >> It is worth noting the case of multiple controllers as well, where multiple >> controllers may be processing the same backlog topic. I don’t think this >> should cause any more trouble than the distributed activation counting that >> should be enabled via controller clustering, but it may mean that if one >> controller enters overflow state, it should signal that ALL controllers are >> now in overflow state, etc. >> >> Regarding “timeout”, I would plan to use the existing timeout mechanism, >> where an ActivationEntry is created immediately, regardless of whether the >> activation is going to get processed, or get added to the backlog. At time >> of processing the backlog message, if the entry is timed out, throw it away. >> (The entry map may need to be shared in the case multiple invokers are in >> use, and they all consume from the same topic; alternatively, we can >> partition the topic so that entries are only processed by the controller >> that has backlogged them) >> >> Yes, once invokers are saturated, and backlogging begins, I think all >> incoming activations should be sent straight to backlog (we already know >> that no invokers are available). This should not hurt overall performance >> anymore than it currently does, and should be better (since the first >> invoker available can start taking work, instead of waiting on a specific >> invoker to become available). >> >> I’m working on a PR, I think much of these details will come out there, but >> in the meantime, let me know if any of this doesn’t make sense. >> >> Thanks >> Tyson >> >> >> On Oct 5, 2017, at 2:49 PM, David P Grove >> <[email protected]<mailto:[email protected]>> wrote: >> >> >> I can see the value in delaying the binding of activations to invokers when >> the system is loaded (can't execute "immediately" on its target invoker). >> >> Perhaps in ignorance, I am a little worried about the scalability of a >> single backlog topic. With a few hundred invokers, it seems like we'd be >> exposed to frequent and expensive partition rebalancing operations as >> invokers crash/restart. Maybe if we have N = K*M invokers, we can get away >> with M backlog topics each being read by K invokers. We could still get >> imbalance across the different backlog topics, but it might be good enough. >> >> I think we'd also need to do some thinking of how to ensure that work put in >> a backlog topic doesn't languish there for a really long time. Once we start >> having work in the backlog, do we need to stop putting work in immediately >> topics? If we do, that could hurt overall performance. If we don't, how will >> the backlog topic ever get drained if most invokers are kept busy servicing >> their immediately topics? >> >> --dave >> >> Tyson Norris ---10/04/2017 07:45:38 PM---Hi - I’ve been discussing a bit >> with a few about optimizing the queueing that goes on ahead of invok >> >> From: Tyson Norris >> <[email protected]<mailto:[email protected]>> >> To: "[email protected]<mailto:[email protected]>" >> <[email protected]<mailto:[email protected]>> >> Date: 10/04/2017 07:45 PM >> Subject: Invoker activation queueing proposal >> >> ________________________________ >> >> >> >> Hi - >> >> I’ve been discussing a bit with a few about optimizing the queueing that >> goes on ahead of invokers so that things behave more simply and predictable. >> >> >> >> In short: Instead of scheduling activations to an invoker on receipt, do the >> following: >> >> - execute the activation "immediately" if capacity is available >> >> - provide a single overflow topic for activations that cannot execute >> “immediately" >> >> - schedule from the overflow topic when capacity is available >> >> >> >> (BTW “Immediately” means: still queued via existing invoker topics, but ONLY >> gets queued there in the case that the invoker is not fully loaded, and >> therefore should execute it “very soon") >> >> >> >> Later: it would also be good to provide more container state data from >> invoker to controller, to get better scheduling options - e.g. if some >> invokers can handle running more containers than other invokers, that info >> can be used to avoid over/under-loading the invokers (currently we assume >> each invoker can handle 16 activations, I think) >> >> >> >> I put a wiki page proposal here: >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__cwiki.apache.org_confluence_display_OPENWHISK_Invoker-2BActivation-2BQueueing-2BChange%26d%3DDwIGaQ%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DFe4FicGBU_20P2yihxV-apaNSFb6BSj6AlkptSF2gMk%26m%3DUE8OIR_GnMltmRZyIuLVHMlzyQvNku-H7kLk67u45IM%26s%3DLD75-npfzA7qzUGNgYbFBy4qKatnkdO5I2vKYSGUBg8%26e&data=02%7C01%7C%7C206a28eb8f9d4f3b131608d50ce4be0f%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636429098932393323&sdata=pLsSnlJRYtL4cHMqciGBsA9kLaHzW1GjbijpJCsD1po%3D&reserved=0=<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttps-3A__cwiki.apache.org_confluence_display_OPENWHISK_Invoker-2BActivation-2BQueueing-2BChange%26d%3DDwIGaQ%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DFe4FicGBU_20P2yihxV-apaNSFb6BSj6AlkptSF2gMk%26m%3DUE8OIR_GnMltmRZyIuLVHMlzyQvNku-H7kLk67u45IM%26s%3DLD75-npfzA7qzUGNgYbFBy4qKatnkdO5I2vKYSGUBg8%26e%3D&data=02%7C01%7C%7C36a3439a232c45d2119c08d50c3b096b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636428370067764903&sdata=MBzAhIAVOdHCG0acu8YKCNmeYXO8T9PcILoQrlUyixw%3D&reserved=0> >> >> >> >> WDYT? >> >> >> >> Thanks >> >> Tyson >> >
