Re: Kafka and Proposal on a future architecture of OpenWhisk

Markus Thömmes Wed, 22 Aug 2018 14:01:05 -0700

Hi Tyson,

Am Mi., 22. Aug. 2018 um 22:49 Uhr schrieb Tyson Norris
<tnor...@adobe.com.invalid>:


> Yes, agreed this makes sense, same as Carlos is saying.
>
> Let's ignore async for now, I think that one is simpler __ - does "A
> blocking request can still be put onto the work-stealing queue" mean that
> it wouldn't always be put on the queue?
>
> If there is existing warm container capacity in the ContainerRouter
> receiving the activation, ideally it would skip the queue - right?
>

Exactly, it should skip the queue whenever possible.


>
> When exactly is the case that a ContainerRouter should put a blocking
> activation to a queue for stealing? Since a) it is not spawning containers
> and b) it is not parsing request/response bodies, can we say this would
> only happen when a ContainerRouter maxes out its incoming request handling?
>

That's exactly the idea! The work-stealing queue will only be used if the
Router where to request landed cannot serve the demand right now. For
example, if it maxed out the slots it has for a certain action (all
containers are working to their full extent) it requests more resources and
puts the request-token on the work-stealing queue.

That request-token will then be taken by any Router that has free capacity
for that action (note: this is not simple with kafka, but might be simpler
with other MQ technologies). Since new resources have been requested, it is
guaranteed that one Router will eventually become free.


>
> If ContainerManager has enough awareness of ContainerRouters' states, I'm
> not sure where using a queue would be used (for redirecting to other
> ContainerRouters) vs ContainerManager responding with a ContainerRouters
> reference (instead of an action container reference) - I'm not following
> the logic of the edge case in the proposal - there is mention of "which
> controller the request needs to go", but maybe this is a typo and should
> say ContainerRouter?
>

Indeed that's a typo, it should say ContainerRouter.

The ContainerManager only knows which Router has which Container. It does
not know whether the respective Router has capacity on that container (the
capacity metric is very hard to share since it's ever changing).

Hence, in an edge-case where there are less Containers than Routers, the
ContainerManager can hand out references to the Routers it gave Containers
to the Routers that have none. (This is the edge-case described in the
proposal).
The work-stealing queue though is used to rebalance work in case one of the
Routers get overloaded.


>
> Thanks
> Tyson
>
> On 8/21/18, 1:16 AM, "Markus Thömmes" <markusthoem...@apache.org> wrote:
>
>     Hi Tyson,
>
>     if we take the concerns apart as I proposed above, timeouts should only
>     ever be triggered after a request is scheduled as you say, that is: As
> soon
>     as it's crossing the user-container mark. With the concern separation,
> it
>     is plausible that blocking invocations are never buffered anywhere,
> which
>     makes a lot of sense, because you cannot persist the open HTTP
> connection
>     to the client anyway.
>
>     To make the distinction clear: A blocking request can still be put
> onto the
>     work-stealing queue to be balanced between different ContainerRouters.
>
>     A blocking request though would never be written to a persistent buffer
>     that's used to be able to efficiently handle async invocations and
>     backpressuring them. That buffer should be entirely separate and could
>     possibly be placed outside of the execution system to make the
> distinction
>     more explicit. The execution system itself would then only deal with
>     request-response style invocations and asynchronous invocations are
> done by
>     having a seperate queue and a consumer that creates HTTP requests to
> the
>     execution system.
>
>     Cheers,
>     Markus
>
>     Am Mo., 20. Aug. 2018 um 23:30 Uhr schrieb Tyson Norris
>     <tnor...@adobe.com.invalid>:
>
>     > Thanks for summarizing Markus.
>     >
>     > Yes this is confusing in context of current system, which stores in
> kafka,
>     > but not to indefinitely wait, since timeout begins immediately
>     > So, I think the problem of buffering/queueing is: when does the
> timeout
>     > begin? If not everything is buffered the same, their timeout should
> not
>     > begin until processing begins.
>     >
>     > Maybe it would make sense to:
>     > * always buffer (indefinitely) to queue for async, never for sync
>     > * timeout for async not started till read from queue - which may be
>     > delayed from time of trigger or http request
>     > * this should also come with some system monitoring to indicate the
> queue
>     > processing is not keeping up with some configurable max delay
> threshold ("I
>     > can’t tolerate delays of > 5 minutes", etc)
>     > * ContainerRouters can only pull from async queue when
>     >         * increasing the number of pending activations won’t exceed
> some
>     > threshold (prevent excessive load of async on ContainerRouters)
>     >         * ContainerManager is not overloaded (can still create
> containers,
>     > or has some configurable way to indicate the cluster is healthy
> enough to
>     > cope with extra processing)
>     >
>     > We could of course make this configurable so that operators can
> choose to:
>     > * treat async/sync activations the same for sync/async (the
> overloaded
>     > system fails when either ContainerManager or ContainerRouters are max
>     > capacity)
>     > * treat async/sync with preference for:
>     >         * sync - where async is buffered for unknown period before
>     > processing, incoming sync traffic (or lack of)
>     >         * async - where sync is sent to the queue, to be processed in
>     > order of receipt interleaved with async traffic (similar to today, I
> think)
>     >
>     > I think the impact here (aside from technical) is the timing
> difference if
>     > we introduce latency in side affects based on the activation being
> sync vs
>     > async.
>     >
>     > I’m also not sure prioritizing message processing between sync/async
>     > internally in ContainerRouter is better than just have some dedicated
>     > ContainerRouters that receive all async activations, and others that
>     > receive all sync activations, but the end result is the same, I
> think.
>     >
>     >
>     > > On Aug 19, 2018, at 4:29 AM, Markus Thömmes <
> markusthoem...@apache.org>
>     > wrote:
>     > >
>     > > Hi Tyson, Carlos,
>     > >
>     > > FWIW I should change that to no longer say "Kafka" but "buffer" or
>     > "message
>     > > queue".
>     > >
>     > > I see two use-cases for a queue here:
>     > > 1. What you two are alluding to: Buffering asynchronous requests
> because
>     > of
>     > > a different notion of "latency sensitivity" if the system is in an
>     > overload
>     > > scenario.
>     > > 2. As a work-stealing type balancing layer between the
> ContainerRouters.
>     > If
>     > > we assume round-robin/least-connected (essentially random)
> scheduling
>     > > between ContainerRouters, we will get load discrepancies between
> them. To
>     > > smoothen those out, a ContainerRouter can put the work on a queue
> to be
>     > > stolen by a Router that actually has space for that work (for
> example:
>     > > Router1 requests a new container, puts the work on the queue while
> it
>     > waits
>     > > for that container, Router2 already has a free container and
> executes the
>     > > action by stealing it from the queue). This does has the added
> complexity
>     > > of breaking a streaming communication between User and Container
> (to
>     > > support essentially unbounded payloads). A nasty wrinkle that might
>     > render
>     > > this design alternative invalid! We could come up with something
> smarter
>     > > here, i.e. only putting a reference to the work on the queue and
> the
>     > > stealer connects to the initial owner directly which then streams
> the
>     > > payload through to the stealer, rather than persisting it
> somewhere.
>     > >
>     > > It is important to note, that in this design, blocking invokes
> could
>     > > potentially gain the ability to have unbounded entities, where
>     > > trigger/non-blocking invokes might need to be subject to a bound
> here to
>     > be
>     > > able to support eventual execution efficiently.
>     > >
>     > > Personally, I'm much more torn to the work-stealing type case. It
>     > implies a
>     > > wholy different notion of using the queue though and doesn't have
> much to
>     > > do with the way we use it today, which might be confusing. It
> could also
>     > > well be the case, that work-stealing type algorithms are easier to
> back
>     > on
>     > > a proper MQ vs. trying to make it work on Kafka.
>     > >
>     > > It might also be important to note that those two use-cases might
> require
>     > > different technologies (buffering vs. queue-backend for
> work-stealing)
>     > and
>     > > could well be seperated in the design as well. For instance,
> buffering
>     > > triggers fires etc. does not necessarily need to be done on the
> execution
>     > > layer but could instead be pushed to another layer. Having the
> notion of
>     > > "async" vs "sync" in the execution layer could be benefitial for
>     > > loadbalancing itself though. Something worth exploring imho.
>     > >
>     > > Sorry for the wall of text, I hope this clarifies things!
>     > >
>     > > Cheers,
>     > > Markus
>     > >
>     > > Am Sa., 18. Aug. 2018 um 02:36 Uhr schrieb Carlos Santana <
>     > > csantan...@gmail.com>:
>     > >
>     > >> triggers get responded right away (202) with an activation is and
> then
>     > >> sent to the queue to be processed async same as async action
> invokes.
>     > >>
>     > >> I think we would keep same contract as today for this type of
>     > activations
>     > >> that are eventually process different from blocking invokes
> including we
>     > >> Actions were the http client hold a connection waiting for the
> result
>     > back.
>     > >>
>     > >> - Carlos Santana
>     > >> @csantanapr
>     > >>
>     > >>> On Aug 17, 2018, at 6:14 PM, Tyson Norris
> <tnor...@adobe.com.INVALID>
>     > >> wrote:
>     > >>>
>     > >>> Hi -
>     > >>> Separate thread regarding the proposal: what is considered for
> routing
>     > >> activations as overload and destined for kafka?
>     > >>>
>     > >>> In general, if kafka is not on the blocking activation path, why
> would
>     > >> it be used at all, if the timeouts and processing expectations of
>     > blocking
>     > >> and non-blocking are the same?
>     > >>>
>     > >>> One case I can imagine: triggers + non-blocking invokes, but
> only in
>     > the
>     > >> case where those have some different timeout characteristics.
> e.g. if a
>     > >> trigger fires an action, is there any case where the activation
> should
>     > be
>     > >> buffered to kafka if it will timeout same as a blocking
> activation?
>     > >>>
>     > >>> Sorry if I’m missing something obvious.
>     > >>>
>     > >>> Thanks
>     > >>> Tyson
>     > >>>
>     > >>>
>     > >>
>     >
>     >
>
>
>

Re: Kafka and Proposal on a future architecture of OpenWhisk

Reply via email to