It would be cool to implement some of this algorithms in synthetic way with
mocks and stubs, and with simulated latency, like create container random
delay between 1 and 2 seconds in controller manager and returns a url which
is just a url to a mock server etc..

then hit it with a load test with different patterns (bursty, noise noisy
neighbor, mix of webaction and trigger fires, etc..) and graph the behavior

-- Carlos

On Fri, Aug 17, 2018 at 7:21 AM Markus Thömmes <markusthoem...@apache.org>
wrote:

> Hi Tyson,
>
> thanks for the great input!
>
> Am Do., 16. Aug. 2018 um 23:14 Uhr schrieb Tyson Norris
> <tnor...@adobe.com.invalid>:
>
> > Thinking more about the singleton aspect, I guess this is mostly an issue
> > for blackbox containers, where manifest/managed containers will mitigate
> at
> > least some of the singleton failure delays by prewarm/stemcell
> containers.
> >
> > So in the case of singleton failure, impacts would be:
> > - managed containers once prewarms are exhausted (may be improved by
> being
> > more intelligent about prewarm pool sizing based on load etc)
> > - managed containers that don’t match any prewarms (similar - if prewarm
> > pool is dynamically configured based on load, this is less problem)
> > - blackbox containers (no help)
> >
> > If the failover of the singleton is too long (I think it will be based on
> > cluster size, oldest node becomes the singleton host iirc), I think we
> need
> > to consider how containers can launch in the meantime. A first step might
> > be to test out the singleton behavior in the cluster of various sizes.
> >
>
> I agree this bit of design is crucial, a few thoughts:
> Pre-warm wouldn't help here, the ContainerRouters only know warm
> containers. Pre-warming is managed by the ContainerManager.
>
> Considering a fail-over scenario: We could consider sharing the state via
> EventSourcing. That is: All state lives inside of frequently snapshotted
> events and thus can be shared between multiple instances of the
> ContainerManager seamlessly. Alternatively, we could also think about only
> working on persisted state. That way, a cold-standby model could fly. We
> should make sure that the state is not "slightly stale" but rather both
> instances see the same state at any point in time. I believe on that
> cold-path of generating new containers, we can live with the extra-latency
> of persisting what we're doing as the path will still be dominated by the
> container creation latency.
>
> Handover time as you say is crucial, but I'd say as it only impacts
> container creation, we could live with, let's say, 5 seconds of
> failover-downtime on this path? What's your experience been on singleton
> failover? How long did it take?
>
>
> >
> > > On Aug 16, 2018, at 11:01 AM, Tyson Norris <tnor...@adobe.com.INVALID>
> > wrote:
> > >
> > > A couple comments on singleton:
> > > - use of cluster singleton will introduce a new single point of failure
> > - from time of singleton node failure, to single resurrection on a
> > different instance, will be an outage from the point of view of any
> > ContainerRouter that does not already have a warm+free container to
> service
> > an activation
> > > - resurrecting the singleton will require transferring or rebuilding
> the
> > state when recovery occurs - in my experience this was tricky, and
> requires
> > replicating the data (which will be slightly stale, but better than
> > rebuilding from nothing); I don’t recall the handover delay (to transfer
> > singleton to a new akka cluster node) when I tried last, but I think it
> was
> > not as fast as I hoped it would be.
> > >
> > > I don’t have a great suggestion for the singleton failure case, but
> > would like to consider this carefully, and discuss the ramifications
> (which
> > may or may not be tolerable) before pursuing this particular aspect of
> the
> > design.
> > >
> > >
> > > On prioritization:
> > > - if concurrency is enabled for an action, this is another
> > prioritization aspect, of sorts - if the action supports concurrency,
> there
> > is no reason (except for destruction coordination…) that it cannot be
> > shared across shards. This could be added later, but may be worth
> > considering since there is a general reuse problem where a series of
> > activations that arrives at different ContainerRouters will create a new
> > container in each, while they could be reused (and avoid creating new
> > containers) if concurrency is tolerated in that container. This would
> only
> > (ha ha) require changing how container destroy works, where it cannot be
> > destroyed until the last ContainerRouter is done with it. And if
> container
> > destruction is coordinated in this way to increase reuse, it would also
> be
> > good to coordinate construction (don’t concurrently construct the same
> > container for multiple containerRouters IFF a single container would
> enable
> > concurrent activations once it is created). I’m not sure if others are
> > desiring this level of container reuse, but if so, it would be worth
> > considering these aspects (sharding/isolation vs sharing/coordination) as
> > part of any redesign.
> >
>
> Yes, I can see where you're heading here. I think this can be generalized:
>
> Assume intra-container concurrency C and number of ContainerRouters R.
> If C > R: Shard the "slots" on this container evenly across R. The
> container can only be destroyed after you receive R acknowledgements of
> doing so.
> If C < R: Hand out 1 slot to C Routers, point the remaining Routers to the
> ones that got slots.
>
> Concurrent creation: Batch creation requests while one container is being
> created. Say you received a request for a new container that has C slots.
> If there are more requests for that container arriving while it is being
> created, don't act on them and fold the creation into the first one. Only
> start creating a new container if the number of resource requests exceed C.
>
> Does that make sense? I think in that model you can set C=1 and it works as
> I envisioned it to work, or set it to C=200 and things will be shared even
> across routers.
>
>
> > >
> > >
> > > WDYT?
> > >
> > > THanks
> > > Tyson
> > >
> > > On Aug 15, 2018, at 8:55 AM, Carlos Santana <csantan...@gmail.com
> > <mailto:csantan...@gmail.com>> wrote:
> > >
> > > I think we should add a section on prioritization for blocking vs.
> async
> > > invokes (none blocking actions a triggers)
> > >
> > > The front door has the luxury of known some intent from the incoming
> > > request, I feel it would make sense to high priority to blocking
> invokes,
> > > and for async they go straight to the queue to be pick up by the system
> > to
> > > eventually run, even if it takes 10 times longer to execute than a
> > blocking
> > > invoke, for example a webaction would take 10ms vs. a DB trigger fire,
> > or a
> > > async webhook takes 100ms.
> > >
> > > Also the controller takes time to convert a trigger and process the
> > rules,
> > > this is something that can also be taken out of hot path.
> > >
> > > So I'm just saying we could optimize the system because we know if the
> > > incoming request is a hot or hotter path :-)
> > >
> > > -- Carlos
> > >
> > >
> >
> >
>

Reply via email to