2020-11-03 00:50:51 UTC - Rodric Rabbah: The max duration is configurable https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604364651218400?thread_ts=1604283859.214600&cid=C3TPCAQG1 ---- 2020-11-03 19:46:41 UTC - Brendan Doyle: Do offline invokers get included in the scheduling algorithm for selecting home invokers? It looks like the state is just sent from the invoker pool and since offline invokers are included in `/invokers` api which calls a function in the load balancer so it seems like they are included in the total invokers for the hashing algorithm. I'm digging through the code and don't see anything to suggest otherwise. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604432801220900?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 19:47:24 UTC - Brendan Doyle: We have a few old invokers that no longer exist (though the kafka topics still exist) so I'm wondering if having around 20% of our invoker pool be offline is affecting our scheduling distribution. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604432844221000?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 19:53:29 UTC - Dominic Kim: I suppose no. IIRC, offline invokers are automatically generated by the max invoker ID. For example, if you have two online invokers, invoker0 and invoker10, all invoker1~9 are automatically generated but offline. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604433209221300?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 19:53:58 UTC - Dominic Kim: And only online invokers are involved in the scheduling. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604433238221500?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 19:57:02 UTC - Brendan Doyle: yea so we have invokers0-33. invokers0-5 are offline. The load balancer is going to use an invoker pool size of 34 to determine the home invoker so the home invoker may hit 0-5. It will just spill over to the next available invoker if it does land on 0-5 based on the step size when actually scheduling the activation and that will effectively act as the home invoker, but I'm wondering if this is impacting our uniform distribution. Or is my reading of the code there incorrect? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604433422221700?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 20:19:23 UTC - Brendan Doyle: I think this is because we never bring our controller cluster down and only perform rolling restarts so it seems like that cluster state is shared between controllers so the offline invokers will never go away unless we re-bootstrap our cluster https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604434763222000?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 20:43:36 UTC - Rodric Rabbah: they should not factor into the scheduling (they’ll appear offline in the invoker map) https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604436216222200?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 20:43:52 UTC - Rodric Rabbah: you’re right once in the map the lb doesn’t forget them https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604436232222400?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 20:44:20 UTC - Rodric Rabbah: i suppose there could be a periodic purge to remove anything offline from the invoker map https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604436260222600?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 20:47:02 UTC - Brendan Doyle: could you sanity check me then? We call `/invokers` api. It returns 0-33 and says 0-5 are offline. `/invokers` calls `invokerHealth()` in the load balancer and `invokerHealth()` just returns `_invokers` which is taken from the cluster state. So that implies to me that the offline count is used towards the scheduling hashing algorithm since `updateInvokers()` in the cluster management just does a take on `_invokers` +1 : Dominic Kim https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604436422222800?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:05:50 UTC - Rodric Rabbah: i concur - the step size is computed when the cluster size changes, and that does not exclude offline invokers https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604437550223100?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:06:30 UTC - Rodric Rabbah: arguably this is a bug in your case, it could lead to unnecessary collisions since the step size may include an increasing number of unusable invokers https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604437590223300?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:21:14 UTC - Brendan Doyle: it's not just step size right, the hash for home invokers could land on one of the offline invokers and then uses the steps to land on the next available invoker which then acts as the home invoker? I think that may have pretty big impact on the distribution and number of collisions https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604438474223500?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:21:52 UTC - Brendan Doyle: or am I misinterpreting that part https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604438512223700?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:22:29 UTC - Rodric Rabbah: you’re correct - the step size could lead to a bad pathology though, i think https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604438549223900?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:22:44 UTC - Rodric Rabbah: it’s not so much the home invoker that’s the issue https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604438564224100?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:23:15 UTC - Rodric Rabbah: it’s the sequence - the step size creates a sequence of invokers to check: 1, 5, 7, … so if you land on 1 and it’s unusable then it checks 5 then 7 https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604438595224300?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:23:39 UTC - Rodric Rabbah: but those are all offline, you’re spending more time searching, and worse, the collision may increase https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604438619224500?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:25:31 UTC - Rodric Rabbah: the heuristic doesn’t expect the invoker to stay offline indefinitely
i can think of several ways to address this - like purging the list periodically, adding a ttl on unusable invokers, or adding an admin api to re-compute the sequence as some examples https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604438731224700?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:32:15 UTC - Brendan Doyle: yea I think we will take this on to fix asap. I like the periodic purge or ttl on unusable invokers. If it does programmatically get cleaned up and then gets brought back up it just should get readded no problem right? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604439135224900?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:33:28 UTC - Rodric Rabbah: right - auto discovery is already handled https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604439208225100?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:34:52 UTC - Rodric Rabbah: i think this is good to have - you could do it fairly easily (:sweat_smile: ) by adding a time stamp to each invoker when it goes offline and then check the time difference compared to “now” --- a question: would you do a ttl on all unusable invokers or just offline https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604439292225300?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:34:55 UTC - Rodric Rabbah: prob should do both https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604439295225500?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:36:35 UTC - Brendan Doyle: I'm trying to figure out how I can get more insight into how severely this may be impacting our uniform distribution because it is 20% of our invoker fleet that is considered `offline`. One question would be the kafka topic? That doesn't get removed right. When it gets readded again will it use the same invoker number to match to the same kafka topic because we don't want to end up creating infinite kafka topics https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604439395225700?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:38:58 UTC - Rodric Rabbah: if the queue is empty, i wouldn’t worry about it, theres not much state associated with empty topics to become an issue i dont recall if there’s an expiration on message in the queue though, so if it’s not empty those messages persist until expired https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604439538225900?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:41:25 UTC - Dominic Kim: > could you sanity check me then? We call `/invokers` api. It returns 0-33 and says 0-5 are offline. `/invokers` calls `invokerHealth()` in the load balancer and `invokerHealth()` just returns `_invokers` which is taken from the cluster state. So that implies to me that the offline count is used towards the scheduling hashing algorithm since `updateInvokers()` in the cluster management just does a take on `_invokers` my bad. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604439685226200?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:42:25 UTC - Dominic Kim: After reading codes, it seems it would also affect the number of managed/blackbox invokers as well. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604439745226600?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:43:42 UTC - Brendan Doyle: ^ that is true. We don't use blackbox so that doesn't effect us but it would affect those fractions white_check_mark : Dominic Kim https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604439822226900?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:45:27 UTC - Rodric Rabbah: @Dominic Kim curious in your pull model/new scheduler what would happen? it’s a no-op right since it’s just pulls from invokers that are available https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604439927227200?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:46:57 UTC - Dominic Kim: In the pull model, the health status of invokers are managed by ETCD with Leases. Each invoker periodically keepalive the lease. If no keepalive is received for certain time, for example, 10s, then the health data is removed. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440017227500?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:47:16 UTC - Dominic Kim: Schedulers will only schedule container creation requests to healthy invokers. +1 : Rodric Rabbah https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440036227700?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:47:56 UTC - Dominic Kim: Invokers are supposed to respond to the container creation request, and if no response is received for some time, schedulers retry sending messages to other invokers. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440076228000?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:49:16 UTC - Dominic Kim: Seems finally we can release the core 1.0.0. partyparrot : Rodric Rabbah https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440156228300?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:49:31 UTC - Dominic Kim: I would continue working on scheduler contribution. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440171228500?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:51:22 UTC - Rodric Rabbah: once you do that we can break everything :smile: i have lots of stuff to start adding sassyparrot : Dominic Kim https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440282228800?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:52:01 UTC - Brendan Doyle: `i have lots of stuff to start adding` - like what :eyes: https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440321229100?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:52:40 UTC - Rodric Rabbah: :sweat_smile: https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440360229300?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:53:17 UTC - Rodric Rabbah: stateful function support, functions in isolates, support for Jamstack (serve static content) +1 : Dominic Kim https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440397229500?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:53:48 UTC - Brendan Doyle: `stateful function support` - :exploding_head: https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440428229700?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:54:09 UTC - Brendan Doyle: whats functions in isolates? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440449229900?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:59:19 UTC - Rodric Rabbah: isolates -> not containers https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440759230100?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:59:30 UTC - Rodric Rabbah: uses v8, similar to cloudflare workers https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440770230300?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:59:37 UTC - Rodric Rabbah: better compute density https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440777230500?thread_ts=1604432801.220900&cid=C3TPCAQG1 ---- 2020-11-03 21:59:48 UTC - Rodric Rabbah: this is work we prototyped with Adobe, a bit overdue to upstream https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1604440788230700?thread_ts=1604432801.220900&cid=C3TPCAQG1 ----