2020-01-11 01:37:41 UTC - Ali Tariq: In order to further investigate, i plotted the id of successful requests. (every request has a unique id - which were assigned linearly) - as you can see there is a big continuous chunk that is dropped altogether which signifies the whole queue was dropped. Could it be an implementation bug? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578706661059000?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 01:40:26 UTC - Rodric Rabbah: whoa these are neat graphs - give me a sec to process https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578706826059400?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 01:42:19 UTC - Rodric Rabbah: whats the x-axis? time? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578706939059600?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 01:44:27 UTC - Rodric Rabbah: i dont immediately see why the requests would get dropped unless they were 429 and never accepted in the first place - are you checking the status codes of the invokes?
also, you can check the kafka queue length for correlation as it should report the invoker queue lenghts https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578707067059900?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 01:52:00 UTC - Ali Tariq: In the first graph ... x-axis is the time, in the second graph we dont really need x-axis (although its also time-axis just not marked), because the requests were assigned numbers as there were launched (starting from 0 till last request 100k) https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578707520060100?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 01:53:13 UTC - Ali Tariq: Yeah ... there are a few 429s as well, but most of the ones that get queued up, dont return 429 and ultimately time out (apiGateway timeout). https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578707593060300?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 01:53:51 UTC - Ali Tariq: How do i check the kafka queue Length? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578707631060500?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:23:06 UTC - Rodric Rabbah: You have to enable Kamon I think @chetanm ? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578712986067800?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:23:40 UTC - Rodric Rabbah: Timing out api gw is ok in that the request is still in the system. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713020068500?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:25:12 UTC - Rodric Rabbah: I don’t know how to explain that dip in the first graph. Assuming your load generator is not the issue if you’re sustaining load the curve should stay at the 1K max https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713112070400?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:25:34 UTC - Rodric Rabbah: Until you stop at which point it drains the queue until it’s empty https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713134071100?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:25:39 UTC - Ali Tariq: yes ... but the big missing chunk in the middle shows that the whole queue (& all queued requests) got dropped somehow - possible crash maybe https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713139071400?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:25:42 UTC - Rodric Rabbah: Did anything die? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713142071700?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:25:47 UTC - Rodric Rabbah: Invoker? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713147072000?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:26:24 UTC - Ali Tariq: I am positive issue is not on workload generator because i have tested the same workload on various serverless platforms +1 : Rodric Rabbah https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713184072900?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:26:37 UTC - Rodric Rabbah: Once a request is in kafka it’s persisted https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713197073300?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:27:14 UTC - Rodric Rabbah: And each invoker will only buffer a limited look ahead so the loss on an invoker going down is limited. Wouldn’t explain 18k https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713234074600?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:27:35 UTC - Ali Tariq: thats true! https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713255074800?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:28:38 UTC - Rodric Rabbah: Something is off - we run 10k/s regularly as part of a perf sniff test and haven’t seen something like this. Thinking... https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713318076100?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:28:59 UTC - Ali Tariq: I want to add ... this is not an anomalous result, i have run this test 4 times, and it occurred every single time https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713339076300?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:29:38 UTC - Rodric Rabbah: How are you getting the data for the first graph? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713378076900?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:30:07 UTC - Ali Tariq: It must have something to do with the `actionInvokesConcurrent` (set at 11000) and `containerMemoryPool` (set at 1000 containers). https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713407077400?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:30:43 UTC - Rodric Rabbah: Container memory pool I believe limits num of containers per invoker. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713443078400?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:31:00 UTC - Ali Tariq: i am running my custom logging server that collects the data from inside the running functions (http request packets) https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713460079100?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:31:18 UTC - Rodric Rabbah: Concurrent Invokers is how much capacity a single namespace can allocated. The ratio is how many invokes a single namespace monopolizes https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713478079800?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:32:05 UTC - Ali Tariq: i have set 50 containers (`containerMemoryPool`) per invoker and i have 20 invokers in the deployment https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713525080900?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:32:06 UTC - Rodric Rabbah: Have you looked at how many activation records are in the db to see that nothing was dropped or how many are dropped? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713526081100?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 03:36:10 UTC - Ali Tariq: I did not for the last run, i will in the next run. But if a request didn't run, there wouldn't be any activation record for that request in the db! (for example 429s ... or the above dip for that matter) https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578713770081800?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 04:21:14 UTC - Ali Tariq: In Openwhisk (consider we only have a single namespace, default), we have `actionInvokesConcurrent` which limits how many active requests can be in the system (`actionInvokesConcurrent` = running+queuedUp). Then we have actual `containerMemoryPool` which signifies the limit on parallel actions running. What is the best practices for these configurations? i know `actionInvokesConcurrent` >= `containerMemoryPool` (should be) otherwise none of the requests will ever get queued up (absorb burst workloads) and everything above `actionInvokesConcurrent` would get statusCode `429` , but what should be optimal case ratio/values? Also, doesn't the difference between `actionInvokesConcurrent` and `containerMemoryPool` kind of decides the queue length for the system (the only namespace)? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578716474083100 ---- 2020-01-11 04:28:02 UTC - Ali Tariq: okay, thanks to our discussion - i went over the request return codes this time! it turns out, out of the ~21k drops this time - about 9k drops are due to 429s and 12k drop are due to 503. The `system is overloaded or down for maintenance` . Currently trying to debug what caused the invokers become unhealthy (have 20 invoker and for some reason all became unhealthy!) But this surely explains the dip. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578716882083300?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 05:18:57 UTC - Ali Tariq: Looking at the logs from `controller` ... all 20 `invokers` were healthy after which one by one, all became unresponsive ```[2020-01-11T03:44:31.058Z] [INFO] [#tid_sid_invokerHealth] [InvokerPool] invoker status changed to 0 -> Healthy, 1 -> Healthy, 2 -> Healthy, 3 -> Healthy, 4 -> Healthy, 5 -> Healthy, 6 -> Healthy, 7 -> Healthy, 8 -> Healthy, 9 -> Healthy, 10 -> Healthy, 11 -> Healthy, 12 -> Healthy, 13 -> Healthy, 14 -> Healthy, 15 -> Healthy, 16 -> Healthy, 17 -> Healthy, 18 -> Healthy, 19 -> Healthy``` to messages like these ```[2020-01-11T03:52:37.630Z] [INFO] [#tid_sid_invokerHealth] [InvokerActor] invoker18 is unresponsive [marker:loadbalancer_invokerState.unresponsive_counter:512082]``` Looking further in controller, the main issue seems to be ```[2020-01-11T03:56:38.627Z] [ERROR] [#tid_8Imd7UIW3OoaJhdAt6OMJO7IuIUPB2No] [ShardingContainerPoolBalancer] failed to schedule activation 26ebd0b5cbca44cfabd0b5cbca64cf5f, action 'guest/lambda1@0.0.1' (blackbox), ns 'guest' - invokers to use: Map(Unresponsive -> 20)``` Finding the reason for above issue in `invoker` logs at about the same timestamp, i found ```[2020-01-11T03:56:38.187Z] [ERROR] [#tid_eIPUPBXTR3TN0joUTclIpPt5NF6rOpIx] [ContainerPool] Rescheduling Run message, too many message in the pool, freePoolSize: 0 containers and 0 MB, busyPoolSize: 50 containers and 12800 MB, maxContainersMemory 12800 MB, userNamespace: guest, action: ExecutableWhiskAction/guest/lambda1@0.0.1, needed memory: 256 MB, waiting messages: 49``` I understand the error, it wants to create new containers incoming requests but there is no space in the `containerMemoryPool` . In such cases, shouldn't it simply queue the request? - why would it cause invokers to become unresponsive! https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578719937083500?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 05:52:01 UTC - Ali Tariq: So in short, if you follow the first graph, it started serving the requests at time 0. At 10s mark, serving hits the concurrency limit of `1000`, at 280s mark, we start receiving `429` s (queue lenth at this point should be approximately 9-10k). By the 617s mark, we had received 9k `429` s , after which we start getting `503` s instead of `429` s. After sometime, invoker come back up again and everything remaining finishes fine. Only thing left to explain is why did invokers become unhealthy? Could it be, because they had been sending too many `429` s, the invokers got overloaded (but if i am not wrong- controller sends `429`s)? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578721921084100?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 05:52:25 UTC - Ali Tariq: Also, what is `busyPoolSize` & `waiting messages` ? (from invoker log message) - i found the code in `ContainerPool.scala` but due to lack of comments, couldn't understand their purpose https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578721945084300?thread_ts=1578504210.020500&cid=C3TPCAQG1 ---- 2020-01-11 09:15:42 UTC - Michele Sciabarra: @chetanm @Dave Grove @Rodric Rabbah thanks for approving the openwhisk standalone docker image, howewer to merge looks like another approval is needed. I would like to ensure the standalone image is built, however I removed this change that publish the image <https://github.com/apache/openwhisk/pull/4782/commits/0e5e9cb855980bcbbc8e8af45ba4ebca0d3cf0f7> because it breaks the build - should we create manually a docker image to be able to upload it? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578734142087900 ---- 2020-01-11 09:16:28 UTC - Michele Sciabarra: @chetanm do you have any idea how I could check for readyness of openwhisk before starting the playground ? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578734188088700 ---- 2020-01-11 11:00:41 UTC - giusdp: @Rodric Rabbah Hey, I was checking out the Invoker class and saw it has a `main` method, the args that are passed to it are used to build the InstanceId. I thought about adding a custom argument to the CLI line to launch an invoker, since that gets picked up by the `main` method and gets added to the InstanceId. Do you know what runs the `main`/where can I find the command line? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578740441088800?thread_ts=1578572550.021500&cid=C3TPCAQG1 ---- 2020-01-11 11:58:24 UTC - giusdp: Ok I found the ansible file that launches that line. Now the problem is to let my own deployment of openwhisk use that ansible file https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578743904089000?thread_ts=1578572550.021500&cid=C3TPCAQG1 ---- 2020-01-11 16:31:23 UTC - giusdp: Hello, how can I access the logs of the controller of openwhisk deployed on a kubernetes cluster? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578760283089700?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 16:43:43 UTC - Rodric Rabbah: I think you just use kubectl https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578761023090100?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 16:54:45 UTC - giusdp: You mean kubectl logs ow-controller-0? I tried that but it doesnt give me anything https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578761685090300?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:00:15 UTC - Ali Tariq: you also need to specify the namespace, `kubectl logs ow-controller-0 -n openwhisk` if thats the namespace you created with helm https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578762015090500?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:04:15 UTC - giusdp: Yes I meant that I did that. It doesnt show any logs unfortunately. After a while it even gives a timeout error https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578762255090800?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:13:21 UTC - Ali Tariq: could you share the output of `kubectl get pods -n openwhisk` ? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578762801091000?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:22:39 UTC - giusdp: ```NAME READY STATUS RESTARTS AGE ow-alarmprovider-ccc874444-c99wm 1/1 Running 0 5m57s ow-apigateway-6b9b9d974f-2dw9v 1/1 Running 0 5m56s ow-cloudantprovider-758d8bb68b-frdhb 1/1 Running 0 5m57s ow-controller-0 1/1 Running 0 5m56s ow-couchdb-5cd8484499-vgbmm 1/1 Running 0 5m57s ow-gen-certs-hx965 0/1 Completed 0 5m56s ow-init-couchdb-gszkq 0/1 Completed 0 5m56s ow-install-packages-75zs9 0/1 Error 0 5m56s ow-install-packages-vjccd 1/1 Running 0 24s ow-invoker-0 1/1 Running 0 5m56s ow-kafka-0 1/1 Running 0 5m56s ow-kafkaprovider-69544b9fc9-sppr5 1/1 Running 0 5m57s ow-nginx-7ff4cb6ff9-mkbw4 1/1 Running 0 5m56s ow-redis-bf84f7756-drnbb 1/1 Running 0 5m57s ow-wskadmin 1/1 Running 0 5m57s ow-zookeeper-0 1/1 Running 0 5m56s wskow-invoker-00-1-prewarm-nodejs10 1/1 Running 0 2m29s wskow-invoker-00-2-prewarm-nodejs10 1/1 Running 0 2m30s wskow-invoker-00-3-whisksystem-invokerhealthtestaction0 1/1 Running 0 2m28s wskow-invoker-00-4-prewarm-nodejs10 1/1 Running 0 34s``` https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578763359091200?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:22:57 UTC - giusdp: The install packages pod gave error and restarted, but i already can create and invoke actions https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578763377091400?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:24:54 UTC - Ali Tariq: does `ow-invoker-0` also not show any logs? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578763494091600?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:26:06 UTC - giusdp: No, just ran `kubectl logs ow-invoker-0 -n openwhisk` and again no output til the timeout https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578763566091800?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:27:41 UTC - Ali Tariq: that is really strange, normally that's the way you access the logs. It could have something to do with the failing pod, do you mind sharing the `mycluster.yaml` you used to create the deployment? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578763661092100?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:35:36 UTC - giusdp: ```whisk: ingress: type: NodePort apiHostName: ip apiHostPort: 31001 nginx: httpsNodePort: 31001 k8s: persistence: hasDefaultStorageClass: "false" explicitStorageClass: "openwhisk-nfs" providers: alarm: enabled: false kafka: enabled: false cloudant: enabled: false controller: imageName: "giusdp/controller" imageTag: "latest"``` https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578764136092300?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:36:25 UTC - giusdp: This is the mycluster file, I was trying to use a modified controller, but I can't access any log :cry: https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578764185092500?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:37:45 UTC - Ali Tariq: did you first test accessing logs with the default controller image? might help narrow down the issue! https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578764265092800?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:40:48 UTC - giusdp: I'll try now https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578764448093000?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:51:41 UTC - giusdp: Unfortunately nothing again :sweat_smile: https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578765101093200?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:53:12 UTC - giusdp: I don't know if it could be a problem from the master node, for example it can't establish a connection to the worker node to retrieve the logs? But if this was the cause openwhisk shouldnt work at all... https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578765192093400?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 17:57:24 UTC - giusdp: Just to try I ran kubectl logs on the apiserver pod that is in the kube-system namespace on the master node, it worked https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578765444093600?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 18:02:29 UTC - Ali Tariq: Wow, that's a new for me! ```whisk: ingress: type: NodePort apiHostName: ip apiHostPort: 31001 invoker: containerFactory: impl: "kubernetes" nginx: httpsNodePort: 31001 k8s: persistence: enabled: false``` i normally use this configuration, never really changed the `providers` defaults & never had any issue! - i doubt that could be the issue. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578765749093800?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 18:05:29 UTC - Ali Tariq: could you try, `k8s:persistence:enabled:false` , maybe your persistent volume isn't letting pods write stuff down to storage. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578765929094000?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 18:12:55 UTC - giusdp: I just deployed openwhisk with your mycluster file but it gave me the same problem. ```Error from server: Get <https://192.168.1.140:10250/containerLogs/openwhisk/ow-controller-0/controller>: dial tcp 192.168.1.140:10250: i/o timeout``` https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578766375094300?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 18:13:23 UTC - Ali Tariq: is the same pod still failing? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578766403094500?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 18:47:12 UTC - giusdp: Yes, no logs from any pod in the worker node are received. I wonder if it's a problem with the connection, like a port problem https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578768432094700?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 19:07:41 UTC - Ali Tariq: to me ... seems more like k8s cluster setup issue, maybe someone else could provide better insights. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578769661096200?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 19:12:53 UTC - Ali Tariq: @Rodric Rabbah apologies for disturbing you but could you provide any insights in this case. i am trying to push for a submission for which i ran some workload benchmarks on Openwhisk. Just want to make sure, the above mentioned behavior is not becuz of some mis-configuration! https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578769973096400?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:21:41 UTC - Rodric Rabbah: The scheduler tries to reuse containers, but favors autoscaling over reuse (because it doesnt track the hold time for a resource). How much time did you wait between the two 400-invoke batches? If they’re sufficiently close in time I would have expected greater reuse. https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578770501096600?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:25:57 UTC - Ali Tariq: right away! (10-20 seconds)- but in the `actionA` & `actionB` case, why does it start destroying `actionA` when it still has space in `containerMemoryPool` https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578770757096800?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:26:26 UTC - giusdp: Thanks for the help anyway! I'll look more into it https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578770786097000?thread_ts=1578760283.089700&cid=C3TPCAQG1 ---- 2020-01-11 19:28:14 UTC - Rodric Rabbah: the invokers don’t over commit memory - so my explanation is that the max number of containers on an invoker as exceeded so the garbage collector reclaimed the oldest containers https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578770894097400?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:28:27 UTC - Rodric Rabbah: a container while paused doesnt consume cpu but it will consume memory https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578770907097600?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:28:54 UTC - Rodric Rabbah: cpu is overcommitted but not memory https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578770934097800?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:29:42 UTC - Rodric Rabbah: it’s also possible in the actionA batches that the 600 pods aren’t all for actionA? the stem cell pool is replenished as the stem cells are used - are these actions of the stem cell kind? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578770982098000?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:30:38 UTC - Rodric Rabbah: happy to read the paper draft to understand the methodology better then i may be able to offer more insights https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771038098200?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:30:42 UTC - Ali Tariq: i created simple `blackbox` kind actions ... i don't know `stem cells` https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771042098400?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:30:55 UTC - Rodric Rabbah: so these are “docker” actions? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771055098700?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:31:00 UTC - Ali Tariq: yes https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771060098900?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:31:10 UTC - Rodric Rabbah: wow ok - did you pre-pull the images? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771070099100?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:31:53 UTC - Rodric Rabbah: stem cells are containers that can run actions of a specific kind (node, python, etc) and is a system optimization to avoid creating new containers for a new action, can save you 500ms or so by not running docker run https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771113099300?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:32:09 UTC - Ali Tariq: i didn't - i thought it pulls on the first invocation and tries to reuse the pull on all consecutive invocations https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771129099500?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:32:30 UTC - Rodric Rabbah: yes, per invoker https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771150099800?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:34:36 UTC - Ali Tariq: so, i should pre-pull on all invokers? that's 20! also, after i ran the workload once ... every invoker has the pulled images. I waited to some time so that idle containers could be deallocated (`10 minutes`) - still the same behavior https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771276100000?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:36:22 UTC - Rodric Rabbah: if youre repeating the experiments on the same nodes, then the images should be there and so disregard the first run for example and all the images resident https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771382100200?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:36:30 UTC - Ali Tariq: ```my explanation is that the max number of containers on an invoker as exceeded so the garbage collector reclaimed the oldest containers``` on this point ... i thought actions are assigned on round-robin fashion - how could one invoker get exceeded? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771390100400?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:38:15 UTC - Ali Tariq: let me re-iterate, this only happens on kube-deploy ... docker-compose deployment showed absolute bin-filling - and looking at the code `ContainerPool.scala` its also written like an absolute bin-filling https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771495100600?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:38:21 UTC - Rodric Rabbah: actions are assigned to invokers based on indexing of the invoker pool using a co-prime hash algorithm (to reduce collisions/noisy neighbors), so they’re round robin from a sub sequence of all invokers when the high water mark on each invoker is reached then requests are queued on a designated home invoker for that user https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771501100800?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:39:32 UTC - Rodric Rabbah: i’m not sure what kube does differently - i was assuming youre not using the kube container factory a https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771572101000?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:39:37 UTC - Ali Tariq: but then again `docker-compose` was only single invoker, so it might not be a fair comparison https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771577101200?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:40:20 UTC - Rodric Rabbah: did you remove the blackbox container segregation? by default blackbox are allocated 10% of the invoker pool only https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771620101400?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:40:43 UTC - Ali Tariq: no, im using `kubernetesContainerFactory` , although for the sake of completeness i did test with `dockerContainerFactory` as well https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771643101600?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:41:04 UTC - Ali Tariq: yes i did - changed to `100%` for blackbox +1 : Rodric Rabbah https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771664101800?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:42:03 UTC - Ali Tariq: when i say `docker-compose` its the `apache/openwhisk-devtools/docker-compose` repository https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771723102100?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:42:28 UTC - Rodric Rabbah: right that’s just one invoker iirc https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771748102400?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:42:39 UTC - Rodric Rabbah: so none of the scheduling heuristic kick in/apply https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771759102600?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:43:20 UTC - Ali Tariq: okay ... that helps! https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771800102800?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:43:35 UTC - Rodric Rabbah: where are you submitting? https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771815103000?thread_ts=1578647722.045200&cid=C3TPCAQG1 ---- 2020-01-11 19:43:40 UTC - Ali Tariq: atc +1 : Rodric Rabbah https://openwhisk-team.slack.com/archives/C3TPCAQG1/p1578771820103200?thread_ts=1578647722.045200&cid=C3TPCAQG1 ----