won't this only be of benefit for invocations that are mostly sleepy? e.g. I/O-bound? because if an action uses CPU flat-out, then there is no throughput win to be had (by increasing the parallelism of CPU-bound processes), given the small CPU sliver that each container gets -- unless there is a concomitant increase in concurrency, i.e. CPU slice?
if so, then my gut tells me that there are more general solutions to this (i.e. more efficient packing of I/O-bound processes) On Mon, May 1, 2017 at 5:36 PM, Tyson Norris <tnor...@adobe.com> wrote: > Thanks Markus. > > Can you direct me to the travis job where I can see the 40+RPS? I agree > that is a big gap and would like to take a look - I didn’t see anything in > https://travis-ci.org/openwhisk/openwhisk/builds/226918375 ; maybe I’m > looking in the wrong place. > > I will work on putting together a PR to discuss. > > Thanks > Tyson > > > On May 1, 2017, at 2:22 PM, Markus Thömmes <markusthoem...@me.com<mailto: > markusthoem...@me.com>> wrote: > > Hi Tyson, > > Sounds like you did a lot of investigation here, thanks a lot for that :) > > Seeing the numbers, 4 RPS in the "off" case seem very odd. The Travis > build that runs the current system as is also reaches 40+ RPS. So we'd need > to look at a mismatch here. > > Other than that I'd indeed suspect a great improvement in throughput from > your work! > > Implementationwise I don't have a strong opionion but it might be worth to > discuss the details first and land your impl. once all my staging is done > (the open PRs). That'd ease git operation. If you want to discuss your > impl. now I suggest you send a PR to my new-containerpool branch and share > the diff here for discussion. > > Cheers, > Markus > > Von meinem iPhone gesendet > > Am 01.05.2017 um 23:16 schrieb Tyson Norris <tnor...@adobe.com<mailto:tnor > r...@adobe.com>>: > > Hi Michael - > Concurrent requests would only reuse a running/warm container for > same-action requests. So if the action has bad/rogue behavior, it will > limit its own throughput only, not the throughput of other actions. > > This is ignoring the current implementation of the activation feed, which > I guess is susceptible to a flood of slow running activations. If those > activations are for the same action, running concurrently should be enough > to not starve the system for other activations (with faster actions) to be > processed. In case they are all different actions, OR not allowed to > execute concurrently, then in the name of quality-of-service, it may also > be desirable to reserve some resources (i.e. separate activation feeds) for > known-to-be-faster actions, so that fast-running actions are not penalized > for existing alongside the slow-running actions. This would require a more > complicated throughput test to demonstrate. > > Thanks > Tyson > > > > > > > > On May 1, 2017, at 1:13 PM, Michael Marth <mma...@adobe.com<mailto:mmart > h...@adobe.com><mailto:mma...@adobe.com>> wrote: > > Hi Tyson, > > 10x more throughput, i.e. Being able to run OW at 1/10 of the cost - > definitely worth looking into :) > > Like Rodric mentioned before I figured some features might become more > complex to implement, like billing, log collection, etc. But given such a > huge advancement on throughput that would be worth it IMHO. > One thing I wonder about, though, is resilience against rogue actions. If > an action is blocking (in the Node-sense, not the OW-sense), would that not > block Node’s event loop and thus block other actions in that container? One > could argue, though, that this rogue action would only block other > executions of itself, not affect other actions or customers. WDYT? > > Michael > > > > > On 01/05/17 17:54, "Tyson Norris" <tnor...@adobe.com<mailto:tnor > r...@adobe.com><mailto:tnor...@adobe.com>> wrote: > > Hi All - > I created this issue some time ago to discuss concurrent requests on > actions: [1] Some people mentioned discussing on the mailing list so I > wanted to start that discussion. > > I’ve been doing some testing against this branch with Markus’s work on the > new container pool: [2] > I believe there are a few open PRs in upstream related to this work, but > this seemed like a reasonable place to test against a variety of the > reactive invoker and pool changes - I’d be interested to hear if anyone > disagrees. > > Recently I ran some tests > - with “throughput.sh” in [3] using concurrency of 10 (it will also be > interesting to test with the --rps option in loadtest...) > - using a change that checks actions for an annotation “max-concurrent” > (in case there is some reason actions want to enforce current behavior of > strict serial invocation per container?) > - when scheduling an actions against the pool, if there is a currently > “busy” container with this action, AND the annotation is present for this > action, AND concurrent requests < max-concurrent, the this container is > used to invoke the action > > Below is a summary (approx 10x throughput with concurrent requests) and I > would like to get some feedback on: > - what are the cases for having actions that require container isolation > per request? node is a good example that should NOT need this, but maybe > there are cases where it is more important, e.g. if there are cases where > stateful actions are used? > - log collection approach: I have not attempted to resolve log collection > issues; I would expect that revising the log sentinel marker to include the > activation ID would help, and logs stored with the activation would include > interleaved activations in some cases (which should be expected with > concurrent request processing?), and require some different logic to > process logs after an activation completes (e.g. logs emitted at the start > of an activation may have already been collected as part of another > activation’s log collection, etc). > - advice on creating a PR to discuss this in more detail - should I wait > for more of the container pooling changes to get to master? Or submit a PR > to Markus’s new-containerpool branch? > > Thanks > Tyson > > Summary of loadtest report with max-concurrent ENABLED (I used 10000, but > this limit wasn’t reached): > [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Target URL: > https://na01.safelinks.protection.outlook.com/?url= > https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%2Factions% > 2FnoopThroughputConcurrent%3Fblocking%3Dtrue&data=02%7C01%7C% > 7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178de > cee1%7C0%7C0%7C636292663971484169&sdata=uv9kYh5uBoIDXDlEivgMClJ6TDGDmz > TdKOgZPZjkBko%3D&reserved=0 > [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Max requests: 10000 > [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Concurrency level: 10 > [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Agent: > keepalive > [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO > [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Completed requests: 10000 > [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total errors: 0 > [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total time: > 241.900480915 s > [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Requests per second: 41 > [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Mean latency: 241.7 > ms > > Summary of loadtest report with max-concurrent DISABLED: > [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Target URL: > https://na01.safelinks.protection.outlook.com/?url= > https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_% > 2Factions%2FnoopThroughput%3Fblocking%3Dtrue&data=02%7C01%7C% > 7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178de > cee1%7C0%7C0%7C636292663971494178&sdata=h6sMS0s2WQXFMcLg8sSAq%2F56p% > 2F%2BmVmth%2B%2FsqTOVmeAc%3D&reserved=0 > [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Max requests: 10000 > [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Concurrency level: 10 > [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Agent: > keepalive > [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO > [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Completed requests: 10000 > [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total errors: 19 > [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total time: > 2770.658048791 s > [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Requests per second: 4 > [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Mean latency: 2767.3 > ms > > > > > > [1] https://na01.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fgithub.com%2Fopenwhisk%2Fopenwhisk% > 2Fissues%2F2026&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa% > 7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=eg% > 2FsSPRQYapQHPNbfMLCW%2B%2F1yAqn8zSo0nJ5yQjmkns%3D&reserved=0 > [2] https://na01.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk% > 2Ftree%2Fnew-containerpool&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce > 7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0% > 7C636292663971494178&sdata=IZcN9szW71SdL%2ByssJm9k3EgzaU4b5idI5yFWyR7% > 2BL4%3D&reserved=0 > [3] https://na01.safelinks.protection.outlook.com/?url= > https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk- > performance&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa% > 7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata= > WkOlhTsplKQm6mUkZtwWLXzCrQg%2FUmKtqOErIw6gFAA%3D&reserved=0 > > >