Why not considering giving developers options to control the level of concurrency, instead of deciding on their behalf ? I think that cases such as the ones Tyson is mentioning make sense; unless we build something that will estimate the resources needed by an action automatically, letting the developer specify it instead, might be a mean of "supervised learning" that the system can use further in order to make decisions at runtime.
Dragos On Mon, May 1, 2017 at 4:46 PM Tyson Norris <tnor...@adobe.com> wrote: > Sure, many of our use cases are mostly stitching together API calls, as > opposed to being CPU bound - consider a simple javascript action that wraps > a downstream http API (or many APIs). > > What do you mean by “more efficient packing of I/O-bound processes”? For > example, in the case of actions that wrap an API call, typically the action > developer is NOT the owner of the API call, so its not clear how to handle > this more efficiently than by creating a nodejs action that proxies > (multiple concurrent) network requests around, but does little actual > computing besides possibly some minor request/response parsing etc. In our > cases we our much more likely to run into bottlenecks with concurrent users > without any concurrent container usage support, unless we greatly over > provision clusters which will provide drastic reduction in efficiency. It > is much simpler to provision for anticipated or immediate load changes when > each new container can support multiple concurrent requests, instead of > each new container supporting a single request. > > More tests demonstrating these cases (e.g. API wrappers, and > compute-centric actions) will help this discussion, I’ll work on providing > those. > > Thanks > Tyson > > > On May 1, 2017, at 3:24 PM, Nick Mitchell <moose...@gmail.com> wrote: > > > > won't this only be of benefit for invocations that are mostly sleepy? > e.g. > > I/O-bound? because if an action uses CPU flat-out, then there is no > > throughput win to be had (by increasing the parallelism of CPU-bound > > processes), given the small CPU sliver that each container gets -- unless > > there is a concomitant increase in concurrency, i.e. CPU slice? > > > > if so, then my gut tells me that there are more general solutions to this > > (i.e. more efficient packing of I/O-bound processes) > > > > On Mon, May 1, 2017 at 5:36 PM, Tyson Norris <tnor...@adobe.com> wrote: > > > >> Thanks Markus. > >> > >> Can you direct me to the travis job where I can see the 40+RPS? I agree > >> that is a big gap and would like to take a look - I didn’t see anything > in > >> > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Fopenwhisk%2Fopenwhisk%2Fbuilds%2F226918375&data=02%7C01%7C%7C8a29a490bc6545d4460408d490e0c979%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292742509382993&sdata=2RiV65g7zvR07ditlzosUxsrWvQIo8WfpMvr7g2JHWY%3D&reserved=0 > ; maybe I’m > >> looking in the wrong place. > >> > >> I will work on putting together a PR to discuss. > >> > >> Thanks > >> Tyson > >> > >> > >> On May 1, 2017, at 2:22 PM, Markus Thömmes <markusthoem...@me.com > <mailto: > >> markusthoem...@me.com>> wrote: > >> > >> Hi Tyson, > >> > >> Sounds like you did a lot of investigation here, thanks a lot for that > :) > >> > >> Seeing the numbers, 4 RPS in the "off" case seem very odd. The Travis > >> build that runs the current system as is also reaches 40+ RPS. So we'd > need > >> to look at a mismatch here. > >> > >> Other than that I'd indeed suspect a great improvement in throughput > from > >> your work! > >> > >> Implementationwise I don't have a strong opionion but it might be worth > to > >> discuss the details first and land your impl. once all my staging is > done > >> (the open PRs). That'd ease git operation. If you want to discuss your > >> impl. now I suggest you send a PR to my new-containerpool branch and > share > >> the diff here for discussion. > >> > >> Cheers, > >> Markus > >> > >> Von meinem iPhone gesendet > >> > >> Am 01.05.2017 um 23:16 schrieb Tyson Norris <tnor...@adobe.com<mailto: > tnor > >> r...@adobe.com>>: > >> > >> Hi Michael - > >> Concurrent requests would only reuse a running/warm container for > >> same-action requests. So if the action has bad/rogue behavior, it will > >> limit its own throughput only, not the throughput of other actions. > >> > >> This is ignoring the current implementation of the activation feed, > which > >> I guess is susceptible to a flood of slow running activations. If those > >> activations are for the same action, running concurrently should be > enough > >> to not starve the system for other activations (with faster actions) to > be > >> processed. In case they are all different actions, OR not allowed to > >> execute concurrently, then in the name of quality-of-service, it may > also > >> be desirable to reserve some resources (i.e. separate activation feeds) > for > >> known-to-be-faster actions, so that fast-running actions are not > penalized > >> for existing alongside the slow-running actions. This would require a > more > >> complicated throughput test to demonstrate. > >> > >> Thanks > >> Tyson > >> > >> > >> > >> > >> > >> > >> > >> On May 1, 2017, at 1:13 PM, Michael Marth <mma...@adobe.com<mailto: > mmart > >> h...@adobe.com><mailto:mma...@adobe.com>> wrote: > >> > >> Hi Tyson, > >> > >> 10x more throughput, i.e. Being able to run OW at 1/10 of the cost - > >> definitely worth looking into :) > >> > >> Like Rodric mentioned before I figured some features might become more > >> complex to implement, like billing, log collection, etc. But given such > a > >> huge advancement on throughput that would be worth it IMHO. > >> One thing I wonder about, though, is resilience against rogue actions. > If > >> an action is blocking (in the Node-sense, not the OW-sense), would that > not > >> block Node’s event loop and thus block other actions in that container? > One > >> could argue, though, that this rogue action would only block other > >> executions of itself, not affect other actions or customers. WDYT? > >> > >> Michael > >> > >> > >> > >> > >> On 01/05/17 17:54, "Tyson Norris" <tnor...@adobe.com<mailto:tnor > >> r...@adobe.com><mailto:tnor...@adobe.com>> wrote: > >> > >> Hi All - > >> I created this issue some time ago to discuss concurrent requests on > >> actions: [1] Some people mentioned discussing on the mailing list so I > >> wanted to start that discussion. > >> > >> I’ve been doing some testing against this branch with Markus’s work on > the > >> new container pool: [2] > >> I believe there are a few open PRs in upstream related to this work, but > >> this seemed like a reasonable place to test against a variety of the > >> reactive invoker and pool changes - I’d be interested to hear if anyone > >> disagrees. > >> > >> Recently I ran some tests > >> - with “throughput.sh” in [3] using concurrency of 10 (it will also be > >> interesting to test with the --rps option in loadtest...) > >> - using a change that checks actions for an annotation “max-concurrent” > >> (in case there is some reason actions want to enforce current behavior > of > >> strict serial invocation per container?) > >> - when scheduling an actions against the pool, if there is a currently > >> “busy” container with this action, AND the annotation is present for > this > >> action, AND concurrent requests < max-concurrent, the this container is > >> used to invoke the action > >> > >> Below is a summary (approx 10x throughput with concurrent requests) and > I > >> would like to get some feedback on: > >> - what are the cases for having actions that require container isolation > >> per request? node is a good example that should NOT need this, but maybe > >> there are cases where it is more important, e.g. if there are cases > where > >> stateful actions are used? > >> - log collection approach: I have not attempted to resolve log > collection > >> issues; I would expect that revising the log sentinel marker to include > the > >> activation ID would help, and logs stored with the activation would > include > >> interleaved activations in some cases (which should be expected with > >> concurrent request processing?), and require some different logic to > >> process logs after an activation completes (e.g. logs emitted at the > start > >> of an activation may have already been collected as part of another > >> activation’s log collection, etc). > >> - advice on creating a PR to discuss this in more detail - should I wait > >> for more of the container pooling changes to get to master? Or submit a > PR > >> to Markus’s new-containerpool branch? > >> > >> Thanks > >> Tyson > >> > >> Summary of loadtest report with max-concurrent ENABLED (I used 10000, > but > >> this limit wasn’t reached): > >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Target URL: > >> https://na01.safelinks.protection.outlook.com/?url= > >> https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_%2Factions% > >> 2FnoopThroughputConcurrent%3Fblocking%3Dtrue&data=02%7C01%7C% > >> 7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178de > >> cee1%7C0%7C0%7C636292663971484169&sdata=uv9kYh5uBoIDXDlEivgMClJ6TDGDmz > >> TdKOgZPZjkBko%3D&reserved=0 > >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Max requests: > 10000 > >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Concurrency level: 10 > >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Agent: > >> keepalive > >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO > >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Completed requests: > 10000 > >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total errors: 0 > >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Total time: > >> 241.900480915 s > >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Requests per second: 41 > >> [Sat Apr 29 2017 16:32:37 GMT+0000 (UTC)] INFO Mean latency: > 241.7 > >> ms > >> > >> Summary of loadtest report with max-concurrent DISABLED: > >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Target URL: > >> https://na01.safelinks.protection.outlook.com/?url= > >> https%3A%2F%2F192.168.99.100%2Fapi%2Fv1%2Fnamespaces%2F_% > >> 2Factions%2FnoopThroughput%3Fblocking%3Dtrue&data=02%7C01%7C% > >> 7C796dfc317cde44c9e83908d490ce7faa%7Cfa7b1b5a7b34438794aed2c178de > >> cee1%7C0%7C0%7C636292663971494178&sdata=h6sMS0s2WQXFMcLg8sSAq%2F56p% > >> 2F%2BmVmth%2B%2FsqTOVmeAc%3D&reserved=0 > >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Max requests: > 10000 > >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Concurrency level: 10 > >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Agent: > >> keepalive > >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO > >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Completed requests: > 10000 > >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total errors: 19 > >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Total time: > >> 2770.658048791 s > >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Requests per second: 4 > >> [Sat Apr 29 2017 19:21:51 GMT+0000 (UTC)] INFO Mean latency: > 2767.3 > >> ms > >> > >> > >> > >> > >> > >> [1] https://na01.safelinks.protection.outlook.com/?url= > >> https%3A%2F%2Fgithub.com%2Fopenwhisk%2Fopenwhisk% > >> 2Fissues%2F2026&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa% > >> > 7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata=eg% > >> 2FsSPRQYapQHPNbfMLCW%2B%2F1yAqn8zSo0nJ5yQjmkns%3D&reserved=0 > >> [2] https://na01.safelinks.protection.outlook.com/?url= > >> https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk% > >> > 2Ftree%2Fnew-containerpool&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce > >> 7faa%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0% > >> 7C636292663971494178&sdata=IZcN9szW71SdL%2ByssJm9k3EgzaU4b5idI5yFWyR7% > >> 2BL4%3D&reserved=0 > >> [3] https://na01.safelinks.protection.outlook.com/?url= > >> https%3A%2F%2Fgithub.com%2Fmarkusthoemmes%2Fopenwhisk- > >> performance&data=02%7C01%7C%7C796dfc317cde44c9e83908d490ce7faa% > >> 7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636292663971494178&sdata= > >> WkOlhTsplKQm6mUkZtwWLXzCrQg%2FUmKtqOErIw6gFAA%3D&reserved=0 > >> > >> > >> > >