Michael , +1 to how you summarized the problem.

> I’d suggest that the first step is to support “multiple heterogeneous
resource pools”

I'd like to reinforce Stephen's idea on "multiple resource pools". We've
been already using this idea in production systems successfully in other
setups, with Mesos, isolating the Spark workloads requiring state, from
other stateless workloads, or from GPU workloads. This idea would be a
perfect fit for Openwhisk. It can also be extended beyond the Invokers, to
other cluster managers like Mesos, and Kube.


On Tue, Jul 4, 2017 at 7:05 AM Stephen Fink <fink.step...@gmail.com> wrote:

> Hi all,
>
> I’ve been lurking a bit on this thread, but haven’t had time to fully
> digest all the issues.
>
> I’d suggest that the first step is to support “multiple heterogeneous
> resource pools”, where a resource pool is a set of invokers managed by a
> load balancer.  There are lots of reasons we may want to support invokers
> with different flavors:  long-running actions, invokers in a VPN, invokers
> with GPUs,  invokers with big memory, invokers which support concurrent
> execution, etc…  .      If we had a general way to plug in a new resource
> pool, folks could feel free to experiment with any new flavors they like
> without having to debate the implications on other flavors.
>
> I tend to doubt that there is a “one size fits all” solution here, so I’d
> suggest we bite the bullet and engineer for heterogeneity.
>
> SJF
>
>
> > On Jul 4, 2017, at 9:55 AM, Michael Marth <mma...@adobe.com.INVALID>
> wrote:
> >
> > Hi Jeremias, all,
> >
> > Tyson and Dragos are travelling this week, so that I don’t know by when
> they get to respond. I have worked with them on this topic, so let me jump
> in and comment until they are able to reply.
> >
> > From my POV having a call like you suggest is a really good idea. Let’s
> wait for Tyson & Dragos to chime in to find a date.
> >
> > As you mention the discussion so far was jumping across different
> topics, especially the use case, the problem to be solved and the proposed
> solution. In preparation of the call I think we can clarify use case and
> problem on the list. Here’s my view:
> >
> > Use Case
> >
> > For us the use case can be summarised with “dynamic, high performance
> websites/mobile apps”. This implies:
> > 1 High concurrency, i.e. Many requests coming in at the same time
> > 2 The code to be executed is the same code across these different
> requests (as opposed to a long tail distribution of many different actions
> being executed concurrently). In our case “many” would mean “hundreds” or a
> few thousand.
> > 3 The latency (time to start execution) matters, because human users are
> waiting for the response. Ideally, in these order of magnitudes of
> concurrent requests the latency should not change much.
> >
> > All 3 requirements need to be satisfied for this use case.
> > In the discussion so far it was mentioned that there are other use cases
> which might have similar requirements. That’s great and I do not want to
> rule them out, obviously. The above is just to make it clear from where we
> are coming from.
> >
> > At this point I would like to mention that it is my understanding that
> this use case is within OpenWhisk’s strike zone, i.e. Something that we all
> think is reasonable to support. Please speak up if you disagree.
> >
> > The Problem
> >
> > One can look at the problem in two ways:
> > Either you keep the resources of the OW system constant (i.e. No
> scaling). In that case latency increases very quickly as demonstrated by
> Tyson’s tests.
> > Or you increase the system’s capacity. In that case the amount of
> machines to satisfy this use case quickly becomes prohibitively expensive
> to run for the OW operator – where expensive is defined as “compared to
> traditional web servers” (in our case a standard Node.js server). Meaning,
> you need 100-1000 concurrent action containers to serve what can be served
> by 1 or 2 Node.js containers.
> >
> > Of course, the proposed solution is not a fundamental “fix” for the
> above. It would only move the needle ~2 orders of magnitude – so that the
> current problem would not be a problem in reality anymore (and simply
> remain as a theoretical problem). For me that would be good enough.
> >
> > The solution approach
> >
> > Would not like to comment on the proposed solution’s details (and leave
> that to Dragos and Tyson). However, it was mentioned that the approach
> would change the programming model for users:
> > Our mindset and approach was that we explicitly do not want  to change
> how OpenWhisk exposes itself to users. Meaning, users should still be able
> to use NPMs, etc  - i.e. This would be an internal implementation detail
> that is not visible for users. (we can make things more explicit to users
> and e.g. Have them requests a special concurrent runtime if we wish to do
> so – so far we tried to make it transparent to users, though).
> >
> > Many thanks
> > Michael
> >
> >
> >
> > On 03/07/17 14:48, "Jeremias Werner" <jeremias.wer...@gmail.com<mailto:
> jeremias.wer...@gmail.com>> wrote:
> >
> > Hi
> >
> > Thanks for the write-up and the proposal. I think this is a nice idea and
> > sounds like a nice way of increasing throughput. Reading through the
> thread
> > it feels like there are different topics/problems mixed-up and the
> > discussion is becoming very complex already.
> >
> > Therefore I would like to suggest that we streamline the discussion a
> bit,
> > maybe in a zoom.us session where we first give Tyson and Dragos the
> chance
> > to walk through the proposal and clarify questions of the audience. Once
> we
> > are all on the same page we could think of a discussion about the
> benefits
> > (improved throughput, latency) vs. challanges (resource sharing, crash
> > model, container lifetime, programming model) on the core of the
> proposal:
> > running multiple activations in a single user container. Once we have a
> > common understanding on that part we could step-up in the architecture
> and
> > discuss what's needed on higher components like invoker/load-balancer to
> > get this integrated.
> >
> > (I said zoom.us session since I liked the one we had a few weeks ago. It
> > was efficient and interactive. If you like I could volunteer to setup the
> > session and/or writing the script/summary)
> >
> > what do you think?
> >
> > Many thanks in advance!
> >
> > Jeremias
> >
> >
> > On Sun, Jul 2, 2017 at 5:43 PM, Rodric Rabbah <rod...@gmail.com<mailto:
> rod...@gmail.com>> wrote:
> >
> > You're discounting with event driven all use cases that are still latency
> > sensitive because they complete a response by call back or actuation at
> > completion. IoT, chatbots, notifications, all examples in addition to ui
> > which are latency sensitive and having uniform expectations on queuing
> time
> > is of value.
> >
> > -r
> >
>
>

Reply via email to