It would be cool to implement some of this algorithms in synthetic way with mocks and stubs, and with simulated latency, like create container random delay between 1 and 2 seconds in controller manager and returns a url which is just a url to a mock server etc..
then hit it with a load test with different patterns (bursty, noise noisy neighbor, mix of webaction and trigger fires, etc..) and graph the behavior -- Carlos On Fri, Aug 17, 2018 at 7:21 AM Markus Thömmes <markusthoem...@apache.org> wrote: > Hi Tyson, > > thanks for the great input! > > Am Do., 16. Aug. 2018 um 23:14 Uhr schrieb Tyson Norris > <tnor...@adobe.com.invalid>: > > > Thinking more about the singleton aspect, I guess this is mostly an issue > > for blackbox containers, where manifest/managed containers will mitigate > at > > least some of the singleton failure delays by prewarm/stemcell > containers. > > > > So in the case of singleton failure, impacts would be: > > - managed containers once prewarms are exhausted (may be improved by > being > > more intelligent about prewarm pool sizing based on load etc) > > - managed containers that don’t match any prewarms (similar - if prewarm > > pool is dynamically configured based on load, this is less problem) > > - blackbox containers (no help) > > > > If the failover of the singleton is too long (I think it will be based on > > cluster size, oldest node becomes the singleton host iirc), I think we > need > > to consider how containers can launch in the meantime. A first step might > > be to test out the singleton behavior in the cluster of various sizes. > > > > I agree this bit of design is crucial, a few thoughts: > Pre-warm wouldn't help here, the ContainerRouters only know warm > containers. Pre-warming is managed by the ContainerManager. > > Considering a fail-over scenario: We could consider sharing the state via > EventSourcing. That is: All state lives inside of frequently snapshotted > events and thus can be shared between multiple instances of the > ContainerManager seamlessly. Alternatively, we could also think about only > working on persisted state. That way, a cold-standby model could fly. We > should make sure that the state is not "slightly stale" but rather both > instances see the same state at any point in time. I believe on that > cold-path of generating new containers, we can live with the extra-latency > of persisting what we're doing as the path will still be dominated by the > container creation latency. > > Handover time as you say is crucial, but I'd say as it only impacts > container creation, we could live with, let's say, 5 seconds of > failover-downtime on this path? What's your experience been on singleton > failover? How long did it take? > > > > > > > On Aug 16, 2018, at 11:01 AM, Tyson Norris <tnor...@adobe.com.INVALID> > > wrote: > > > > > > A couple comments on singleton: > > > - use of cluster singleton will introduce a new single point of failure > > - from time of singleton node failure, to single resurrection on a > > different instance, will be an outage from the point of view of any > > ContainerRouter that does not already have a warm+free container to > service > > an activation > > > - resurrecting the singleton will require transferring or rebuilding > the > > state when recovery occurs - in my experience this was tricky, and > requires > > replicating the data (which will be slightly stale, but better than > > rebuilding from nothing); I don’t recall the handover delay (to transfer > > singleton to a new akka cluster node) when I tried last, but I think it > was > > not as fast as I hoped it would be. > > > > > > I don’t have a great suggestion for the singleton failure case, but > > would like to consider this carefully, and discuss the ramifications > (which > > may or may not be tolerable) before pursuing this particular aspect of > the > > design. > > > > > > > > > On prioritization: > > > - if concurrency is enabled for an action, this is another > > prioritization aspect, of sorts - if the action supports concurrency, > there > > is no reason (except for destruction coordination…) that it cannot be > > shared across shards. This could be added later, but may be worth > > considering since there is a general reuse problem where a series of > > activations that arrives at different ContainerRouters will create a new > > container in each, while they could be reused (and avoid creating new > > containers) if concurrency is tolerated in that container. This would > only > > (ha ha) require changing how container destroy works, where it cannot be > > destroyed until the last ContainerRouter is done with it. And if > container > > destruction is coordinated in this way to increase reuse, it would also > be > > good to coordinate construction (don’t concurrently construct the same > > container for multiple containerRouters IFF a single container would > enable > > concurrent activations once it is created). I’m not sure if others are > > desiring this level of container reuse, but if so, it would be worth > > considering these aspects (sharding/isolation vs sharing/coordination) as > > part of any redesign. > > > > Yes, I can see where you're heading here. I think this can be generalized: > > Assume intra-container concurrency C and number of ContainerRouters R. > If C > R: Shard the "slots" on this container evenly across R. The > container can only be destroyed after you receive R acknowledgements of > doing so. > If C < R: Hand out 1 slot to C Routers, point the remaining Routers to the > ones that got slots. > > Concurrent creation: Batch creation requests while one container is being > created. Say you received a request for a new container that has C slots. > If there are more requests for that container arriving while it is being > created, don't act on them and fold the creation into the first one. Only > start creating a new container if the number of resource requests exceed C. > > Does that make sense? I think in that model you can set C=1 and it works as > I envisioned it to work, or set it to C=200 and things will be shared even > across routers. > > > > > > > > > > > WDYT? > > > > > > THanks > > > Tyson > > > > > > On Aug 15, 2018, at 8:55 AM, Carlos Santana <csantan...@gmail.com > > <mailto:csantan...@gmail.com>> wrote: > > > > > > I think we should add a section on prioritization for blocking vs. > async > > > invokes (none blocking actions a triggers) > > > > > > The front door has the luxury of known some intent from the incoming > > > request, I feel it would make sense to high priority to blocking > invokes, > > > and for async they go straight to the queue to be pick up by the system > > to > > > eventually run, even if it takes 10 times longer to execute than a > > blocking > > > invoke, for example a webaction would take 10ms vs. a DB trigger fire, > > or a > > > async webhook takes 100ms. > > > > > > Also the controller takes time to convert a trigger and process the > > rules, > > > this is something that can also be taken out of hot path. > > > > > > So I'm just saying we could optimize the system because we know if the > > > incoming request is a hot or hotter path :-) > > > > > > -- Carlos > > > > > > > > > > >