Re: Proposal on a future architecture of OpenWhisk

TzuChiao Yeh Mon, 20 Aug 2018 03:29:27 -0700

Yes, exactly.

Sorry if my poor English bothering you :(, I'll try my best to correct
texts. I don't have an accurate model in mind, just share some thoughts I
think that might be helpful:


As you say before, there's some pre-conditions for scheduling decision:
unbounded/bounded system, fair/unfair scheduling etc.

For an ubounded system, providers may not that care about the problem on
over-estimation; on the contrary, the resource bounded system cares about
the "overall throughput" and bounded resource utilization is stable, and
potentially cause fair scheduling decisions: "paying penalties as you go
more". Therefore, the following mechanism is based on the "assumption of
using bounded system".

I'd read a scholar paper quite relevant to this, but forget some detail on
it and will re-read it after. The basic idea is splitting queues into a
warm queues and a cold-start queues, and add delay (penalty) on pulling
cold-start queue. In the context of OW:

1. ContainerRouters duplicate and queue activation (reference) into warm
and cold-start queue.
2. ContainerRouter pull out activation (reference) from warm queue if a
container is available again, and drop the activation (reference) from
cold-start queue.
3. ContainerManager pull out activation (reference) as creation request
from cold-start queue with an "incremental delay".
4. Continue (3), ContainerManager doesn't pull activation (reference) from
warm-start queue. Once the activation (reference) being stolen out via
ContainerRouter during creation. There's a "over-estimate" occurred.

I believe scheduling in serverless model should care about how much
queueing events associated with available resource slots and even how many
slots we already allocated for a specific action, namespace, etc in bounded
system.  The critical point is how do we set the incremental delay, but I
think ContainerManager potentially has enough information to do a smarter
decision between these metrics. In addition, since this is not a critical
path, we can afford a slightly gained latency here for better system
throughput.

I.e. an intuitive approach: *NextPollDelay = DelayFactor *
IncrementalFactor * CurrentAllocatedSlotsRatio * NumOfOverEstimate  /
QueuedEvents*

And we can make user configure delay factor, i.e. 0 for 0 poll delay in
system doesn't really care about this (that we can have a unified model for
either bounded or unbounded system) or customized how much penalty would
like to pay if a burst occurred.

This is quite straightforward and may have plenty of problems I think, i.e.
1. serverless workload has uncertain elapsed time, 2. latency in OW system
needs to acquiring information and making decision, 3. message queue
operation latency, 4. gains more complex if join pre-warmed model and
priority into scheduling. 5. will this break serverless pricing model? 6.
...

I think there's no significant change on the big picture of future
architecture and should not stop us going forward from now. If folks pay
more interests on the problem of over-estimation, we can further find out a
proper solution after having more detail on the future architecture. Since
throttling already help us to avoid from this situation.

Thanks!

On Mon, Aug 20, 2018 at 4:03 PM Markus Thömmes <markusthoem...@apache.org>
wrote:

> Am So., 19. Aug. 2018 um 18:59 Uhr schrieb TzuChiao Yeh <
> su3g4284zo...@gmail.com>:
>
> > On Sun, Aug 19, 2018 at 7:13 PM Markus Thömmes <
> markusthoem...@apache.org>
> > wrote:
> >
> > > Hi Tzu-Chiao,
> > >
> > > Am Sa., 18. Aug. 2018 um 06:56 Uhr schrieb TzuChiao Yeh <
> > > su3g4284zo...@gmail.com>:
> > >
> > > > Hi Markus,
> > > >
> > > > Nice thoughts on separating logics in this revision! I'm not sure
> this
> > > > question has already been clarified, sorry if duplicate.
> > > >
> > > > Same question on cluster singleton:
> > > >
> > > > I think there will be two possibilities on container deletion: 1.
> > > > ContainerRouter removes it (when error or idle-state) 2.
> > ContainerManager
> > > > decides to remove it (i.e. clear space for new creation).
> > > >
> > > > For case 2, how do we ensure the safe deletion in ContainerManager?
> > > > Consider if there's still a similar model on busy/free/prewarmed
> pool,
> > it
> > > > might require additional states related to containers from busy to
> free
> > > > state, then we can safely remove it or reject if nothing found
> (system
> > > > overloaded).
> > > >
> > > > By paused state or other states/message? There might be some
> trade-offs
> > > on
> > > > granularity (time-slice in scheduling) and performance bottleneck on
> > > > ClusterSingleton.
> > > >
> >
> > I'm not sure if I quite got the point, but here's an attempt on an
> > > explanation:
> > >
> > > Yes, Container removal in case 2 is triggered from the
> ContainerManager.
> > To
> > > be able to safely remove it, it requests all ContainerRouters owning
> that
> > > container to stop serving it and hand it back. Once it's been handed
> > back,
> > > the ContainerManager can safely delete it. The contract should also
> say:
> > A
> > > container must be handed back in unpaused state, so it can be deleted
> > > safely. Since the ContainerRouters handle pause/unpause, they'll need
> to
> > > stop serving the container, unpause it, remove it from their state and
> > > acknowledge to the ContainerManager that they handed it back.
> > >
> >
> > Thank you, it's clear to me.
> >
> >
> > > There is an open question on when to consider a system to be in
> overflow
> > > state, or rather: How to handle the edge-situation. If you cannot
> > generate
> > > more containers, we need to decide whether we remove another container
> > (the
> > > case you're describing) or if we call it quits and say "503,
> overloaded,
> > go
> > > away for now". The logic deciding this is up for discussion as well.
> The
> > > heuristic could take into account how many resources in the whole
> system
> > > you already own, how many resources do others own and if we want to
> > decide
> > > to share those fairly or not-fairly. Note that this is also very much
> > > related to being able to scale the resources up in themselves (to be
> able
> > > to generate new containers). If we assume a bounded system though, yes,
> > > we'll need to find a strategy on how to handle this case. I believe
> with
> > > the state the ContainerManager has, it can provide a more eloquent
> answer
> > > to that question than what we can do today (nothing really, we just
> keep
> > on
> > > churning through containers).
> > >
> >
> > I agree. An additional problem is in the case of burst requests,
> > ContainerManager will "over-estimate" containers allocation, whether
> > work-stealing between ContainerRouters has been enabled or not. For
> bounded
> > system, we have better carefully handle these to avoid frequently
> > creation/deletion. I'm wondering if sharing message queue between
> > ContainerManager (since it's not a critical path) or any mechanism for
> > checking queue size (i.e. checking kafka lags) can possibly eliminate
> > this?  However, this may be only happened in short running tasks and
> > throttling already being helpful.
> >
>
> Are you saying: It will over-estimate container allocation because it will
> create a container for each request as they arrive if there are no
> containers around currently and the actual number of containers needed
> might be lower for very short running use-cases where requests arrive in
> short bursts?
>
> If so: I agree, I don't see how any system can possibly solve this without
> taking the estimated runtime of each request into account though. Can you
> elaborate on how your thoughts on checking queue-size etc?
>
>
> >
> >
> > > Does that answer the question?
> >
> >
> > > >
> > > > Thanks!
> > > >
> > > > Tzu-Chiao
> > > >
> > > > On Sat, Aug 18, 2018 at 5:55 AM Tyson Norris
> <tnor...@adobe.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > Ugh my reply formatting got removed!!! Trying this again with some
> >>
> > > > >
> > > > > On Aug 17, 2018, at 2:45 PM, Tyson Norris
> <tnor...@adobe.com.INVALID
> > > > > <mailto:tnor...@adobe.com.INVALID>> wrote:
> > > > >
> > > > >
> > > > > If the failover of the singleton is too long (I think it will be
> > based
> > > on
> > > > > cluster size, oldest node becomes the singleton host iirc), I think
> > we
> > > > need
> > > > > to consider how containers can launch in the meantime. A first step
> > > might
> > > > > be to test out the singleton behavior in the cluster of various
> > sizes.
> > > > >
> > > > >
> > > > > I agree this bit of design is crucial, a few thoughts:
> > > > > Pre-warm wouldn't help here, the ContainerRouters only know warm
> > > > > containers. Pre-warming is managed by the ContainerManager.
> > > > >
> > > > >
> > > > > >> Ah right
> > > > >
> > > > >
> > > > >
> > > > > Considering a fail-over scenario: We could consider sharing the
> state
> > > via
> > > > > EventSourcing. That is: All state lives inside of frequently
> > > snapshotted
> > > > > events and thus can be shared between multiple instances of the
> > > > > ContainerManager seamlessly. Alternatively, we could also think
> about
> > > > only
> > > > > working on persisted state. That way, a cold-standby model could
> fly.
> > > We
> > > > > should make sure that the state is not "slightly stale" but rather
> > both
> > > > > instances see the same state at any point in time. I believe on
> that
> > > > > cold-path of generating new containers, we can live with the
> > > > extra-latency
> > > > > of persisting what we're doing as the path will still be dominated
> by
> > > the
> > > > > container creation latency.
> > > > >
> > > > >
> > > > >
> > > > > >> Wasn’t clear if you mean not using ClusterSingleton? To be clear
> > in
> > > > > ClusterSingleton case there are 2 issues:
> > > > > - time it takes for akka ClusterSingletonManager to realize it
> needs
> > to
> > > > > start a new actor
> > > > > - time it takes for the new actor to assume a usable state
> > > > >
> > > > > EventSourcing (or ext persistence) may help with the latter, but we
> > > will
> > > > > need to be sure the former is tolerable to start with.
> > > > > Here is an example test from akka source that may be useful
> > (multi-jvm,
> > > > > but all local):
> > > > >
> > > > >
> > > >
> > >
> >
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fakka%2Fakka%2Fblob%2F009214ae07708e8144a279e71d06c4a504907e31%2Fakka-cluster-tools%2Fsrc%2Fmulti-jvm%2Fscala%2Fakka%2Fcluster%2Fsingleton%2FClusterSingletonManagerChaosSpec.scala&amp;data=02%7C01%7Ctnorris%40adobe.com%7C50be947ede884f3b78e208d6048ac99a%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C636701391474213555&amp;sdata=Ojk1yRGCbG4OxD5MXOabmH1ggbgk%2BymZ7%2BUqDQINAPo%3D&amp;reserved=0
> > > > >
> > > > > Some things to consider, that I don’t know details of:
> > > > > - will the size of cluster affect the singleton behavior in case of
> > > > > failure? (I think so, but not sure, and what extent); in the simple
> > > test
> > > > > above it takes ~6s for the replacement singleton to begin startup,
> > but
> > > if
> > > > > we have 100s of nodes, I’m not sure how much time it will take. (I
> > > don’t
> > > > > think this should be hard to test, but I haven’t done it)
> > > > > - in case of hard crash, what is the singleton behavior? In
> graceful
> > > jvm
> > > > > termination, I know the cluster behavior is good, but there is
> always
> > > > this
> > > > > question about how downing nodes will be handled. If this critical
> > > piece
> > > > of
> > > > > the system relies on akka cluster functionality, we will need to
> make
> > > > sure
> > > > > that the singleton can be reconstituted, both in case of graceful
> > > > > termination (restart/deployment events) and non-graceful
> termination
> > > > (hard
> > > > > vm crash, hard container crash) . This is ignoring more complicated
> > > cases
> > > > > of extended network partitions, which will also have bad affects on
> > > many
> > > > of
> > > > > the downstream systems.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Handover time as you say is crucial, but I'd say as it only impacts
> > > > > container creation, we could live with, let's say, 5 seconds of
> > > > > failover-downtime on this path? What's your experience been on
> > > singleton
> > > > > failover? How long did it take?
> > > > >
> > > > >
> > > > >
> > > > > >> Seconds in the simplest case, so I think we need to test it in a
> > > > scaled
> > > > > case (100s of cluster nodes), as well as the hard crash case (where
> > not
> > > > > downing the node may affect the cluster state).
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Aug 16, 2018, at 11:01 AM, Tyson Norris
> <tnor...@adobe.com.INVALID
> > > > > <mailto:tnor...@adobe.com.INVALID><mailto:
> tnor...@adobe.com.INVALID
> > >>
> > > > > wrote:
> > > > >
> > > > > A couple comments on singleton:
> > > > > - use of cluster singleton will introduce a new single point of
> > failure
> > > > > - from time of singleton node failure, to single resurrection on a
> > > > > different instance, will be an outage from the point of view of any
> > > > > ContainerRouter that does not already have a warm+free container to
> > > > service
> > > > > an activation
> > > > > - resurrecting the singleton will require transferring or
> rebuilding
> > > the
> > > > > state when recovery occurs - in my experience this was tricky, and
> > > > requires
> > > > > replicating the data (which will be slightly stale, but better than
> > > > > rebuilding from nothing); I don’t recall the handover delay (to
> > > transfer
> > > > > singleton to a new akka cluster node) when I tried last, but I
> think
> > it
> > > > was
> > > > > not as fast as I hoped it would be.
> > > > >
> > > > > I don’t have a great suggestion for the singleton failure case, but
> > > > > would like to consider this carefully, and discuss the
> ramifications
> > > > (which
> > > > > may or may not be tolerable) before pursuing this particular aspect
> > of
> > > > the
> > > > > design.
> > > > >
> > > > >
> > > > > On prioritization:
> > > > > - if concurrency is enabled for an action, this is another
> > > > > prioritization aspect, of sorts - if the action supports
> concurrency,
> > > > there
> > > > > is no reason (except for destruction coordination…) that it cannot
> be
> > > > > shared across shards. This could be added later, but may be worth
> > > > > considering since there is a general reuse problem where a series
> of
> > > > > activations that arrives at different ContainerRouters will create
> a
> > > new
> > > > > container in each, while they could be reused (and avoid creating
> new
> > > > > containers) if concurrency is tolerated in that container. This
> would
> > > > only
> > > > > (ha ha) require changing how container destroy works, where it
> cannot
> > > be
> > > > > destroyed until the last ContainerRouter is done with it. And if
> > > > container
> > > > > destruction is coordinated in this way to increase reuse, it would
> > also
> > > > be
> > > > > good to coordinate construction (don’t concurrently construct the
> > same
> > > > > container for multiple containerRouters IFF a single container
> would
> > > > enable
> > > > > concurrent activations once it is created). I’m not sure if others
> > are
> > > > > desiring this level of container reuse, but if so, it would be
> worth
> > > > > considering these aspects (sharding/isolation vs
> > sharing/coordination)
> > > as
> > > > > part of any redesign.
> > > > >
> > > > >
> > > > > Yes, I can see where you're heading here. I think this can be
> > > > generalized:
> > > > >
> > > > > Assume intra-container concurrency C and number of ContainerRouters
> > R.
> > > > > If C > R: Shard the "slots" on this container evenly across R. The
> > > > > container can only be destroyed after you receive R
> acknowledgements
> > of
> > > > > doing so.
> > > > > If C < R: Hand out 1 slot to C Routers, point the remaining Routers
> > to
> > > > the
> > > > > ones that got slots.
> > > > >
> > > > >
> > > > >
> > > > > >>Yes, mostly - I think there is also a case where destruction
> > message
> > > is
> > > > > revoked by the same router (receiving a new activation for the
> > > container
> > > > > which it previously requested destruction of). But I think this is
> > > > covered
> > > > > in the details of tracking “after you receive R acks of
> destructions”
> > > > >
> > > > >
> > > > >
> > > > > Concurrent creation: Batch creation requests while one container is
> > > being
> > > > > created. Say you received a request for a new container that has C
> > > slots.
> > > > > If there are more requests for that container arriving while it is
> > > being
> > > > > created, don't act on them and fold the creation into the first
> one.
> > > Only
> > > > > start creating a new container if the number of resource requests
> > > exceed
> > > > C.
> > > > >
> > > > > Does that make sense? I think in that model you can set C=1 and it
> > > works
> > > > as
> > > > > I envisioned it to work, or set it to C=200 and things will be
> shared
> > > > even
> > > > > across routers.
> > > > >
> > > > >
> > > > > >> Side note: One detail about the pending concurrency impl today
> is
> > > that
> > > > > due to the async nature of tracking the active activations within
> the
> > > > > container, there is no guarantee (when C>1) that the number is
> exact,
> > > so
> > > > if
> > > > > you specify C=200, you may actually get a different container at
> 195
> > or
> > > > > 205. This is not really related to this discussion, but is based on
> > the
> > > > > current messaging/future behavior in ContainerPool/ContainerProxy,
> so
> > > > > wanted to mention it explicitly, in case it matters to anyone.
> > > > >
> > > > > Thanks
> > > > > Tyson
> > > > >
> > > > >
> > > >
> > > > --
> > > > Tzu-Chiao Yeh (@tz70s)
> > > >
> > >
> >
> >
> > --
> > Tzu-Chiao Yeh (@tz70s)
> >
>


-- 
Tzu-Chiao Yeh (@tz70s)

Re: Proposal on a future architecture of OpenWhisk

Reply via email to