"...When there is no more activation message, ContainerProxy will be wait for the given time(configurable) and just stop...."
How does the system allocate and de-allocate resources when it's congested ? I'm thinking at the use case where the system receives a batch of activations that require 60% of all cluster resources. Once those activations finish, a different batch of activations are received, and this time the new batch requires new actions to be cold-started; these new activations require a total of 80% of the overall cluster resources. Unless the previous actions are removed, the cluster is over-allocated. In the current model would the cluster process 1/2 of the new activations b/c it needs to wait for the previous actions to stop by themselves ? On Sun, Apr 7, 2019 at 7:34 PM Dominic Kim <[email protected]> wrote: > Hi Mingyu > > Thank you for the good questions. > > Before answering to your question, I will share the Lease in ETCD first. > ETCD has a data model which is disappear after given time if there is no > relevant keepalive on it, the Lease. > > So once you grant a new lease, you can put it with data in each operation > such as put, putTxn(transaction), etc. > If there is no keep-alive for the given(configurable) time, inserted data > will be gone. > > In my proposal, most of data in ETCD rely on a lease. > For example, each scheduler stores their endpoint information(for queue > creation) with a lease. Each queue stores their information(for activation) > in ETCD with a lease. > (It is overhead to do keep-alive in each memory queue separately, I > introduced EtcdKeepAliveService to share one global lease among all queues > in a same scheduler.) > Each ContainerProxy store their information in ETCD with a lease so that > when a queue tries to create a container, they can easily count the number > of existing containers with "Count" API. > Both data are stored with a lease, if one scheduler or invoker are failed, > keep-alive for the given lease is not continued, and finally those data > will be removed. > > Follower queues are watching on the leader queue information. If there are > any changes,(the data will be removed upon scheduler failure) they can > receive the notification and start new leader election. > When a scheduler is failed, ContainerProxys which were communicating with a > queue in that scheduler, will receive a connection error. > Then they are designed to access to the ETCD again to figure out the > endpoint of the leader queue. > As one of followers becomes a new leader, ContainerProxys can connect to > the new leader. > > One thing to note here is, there is only one QueueManager in each > scheduler. > One QueueManager holds all queues and delegate requests to the proper queue > in respond to "fetch" requests. > > In short, all endpoints data are stored in ETCD and they are renewed based > on keep-alive and lease. > Each components are designed to access ETCD when the failure detected and > connect to a new(failed-over) scheduler. > > I hope it is useful to you. > And I think when I and my colleagues open PRs, we need to add more detail > design along with them. > > If you have any further questions, kindly let me know. > > Thanks > Best regards > Dominic > > > > 2019년 4월 6일 (토) 오전 11:28, Mingyu Zhou <[email protected]>님이 작성: > > > Dear Dominic, > > > > Thanks for your proposal. It is very inspirational and it looks > promising. > > > > I'd like to ask some questions about the fall over/failure recovery > > mechanism of the scheduler component. > > > > IIUC, a scheduler instance hosts multiple queue managers. If a scheduler > is > > down, we will lose multiple queue managers. Thus, there should be some > form > > of failure recovery of queue managers and it raises the following > > questions: > > > > 1. In your proposal, how the failure of a scheduler is detected? I.e., > > when a scheduler instance is down and some queue manager become > > unreachable, which component will be aware of this unavailability and > then > > trigger the recovery procedure? > > > > 2. How should the failure be recovered and lost queue managers be brought > > back to life? Specifically, in your proposal, you designed a hot > > standing-by pairing of queue managers (one leader/two followers). Then > how > > should the new leader be selected in face of scheduler crash? And do we > > need to allocate a new queue manager to maintain the > > one-leader-two-follower configuration? > > > > 3. How will the other components in the system learn the new > configuration > > after a fall over? For example, how will the pool balancer discover the > new > > state of the scheduler it managers and change its policy to distribute > > queue creation requests? > > > > Thanks > > Mingyu Zhou > > > > On Fri, Apr 5, 2019 at 2:56 PM Dominic Kim <[email protected]> wrote: > > > > > Dear David, Matt, and Dascalita. > > > Thank you for your interest in my proposal. > > > > > > Let me answer your questions one by one. > > > > > > @David > > > Yes, I will(and actually already did) implement all components based on > > > SPI. > > > The reason why I said "breaking changes" is, my proposal will affect > most > > > of components drastically. > > > For example, InvokerReactive will become a SPI and current > > InvokerReactive > > > will become one of its concrete implementation. > > > My load balancer and throttler are also based on the current SPI. > > > So though my implementation would be included in OpenWhisk, downstreams > > > still can take advantage of existing implementation such as > > > ShardingPoolBalancer. > > > > > > Regarding Leader/Follower, a fair point. > > > The reason why I introduced such a model is to prepare for the future > > > enhancement. > > > Actually, I reached a conclusion that memory based activation passing > > would > > > be enough for OpenWhisk in terms of message persistence. > > > But it is just my own opinion and community may want more rigid level > of > > > persistence. > > > I naively thought we can add replication and HA logic in the queue > which > > > are similar to the one in Kafka. > > > And Leader/Follower would be a good base building block for this. > > > > > > Currently only a leader fetch activation messages from Kafka. Followers > > > will be idle while watching the leadership change. > > > Once the leadership is changed, one of followers will become a new > leader > > > and at that time, Kafka consumer for the new leader will be created. > > > This is to minimize the failure handling time in the aspect of clients > as > > > you mentioned. It is also correct that this flow does not prevent > > > activation messages lost on scheduler failure. > > > But it's not that complex as activation messages are not replicated to > > > followers and the number of followers are configurable. > > > If we want, we can configure the number of required queue to 1, there > > will > > > be only one leader queue. > > > (If we ok with the current level of persistence, I think we may not > need > > > more than 1 queue as you said.) > > > > > > Regarding pulling activation messages, each action will have its own > > Kafka > > > topic. > > > It is same with what I proposed in my previous proposals. > > > When an action is created, a Kafka topic for the action will be > created. > > > So each leader queue(consumer) will fetch activation messages from its > > own > > > Kafka topic and there would be no intervention among actions. > > > > > > When I and my colleagues open PRs for each component, we will add > detail > > > component design. > > > It would help you guys understand the proposal more. > > > > > > @Matt > > > Thank you for the suggestion. > > > If I change the name of it now, it would break the link in this thread. > > > I would use the name you suggested when I open a PR. > > > > > > > > > @Dascalita > > > > > > Interesting idea. > > > Any GC patterns do you keep in your mind to apply in OpenWhisk? > > > > > > In my proposal, the container GC is similar to what OpenWhisk does > these > > > days. > > > Each container will autonomously fetch activations from the queue. > > > Whenever they finish invocation of one activation, they will fetch the > > next > > > request and invoke it. > > > In this sense, we can maximize the container reuse. > > > > > > When there is no more activation message, ContainerProxy will be wait > for > > > the given time(configurable) and just stop. > > > One difference is containers are no more paused, they are just removed. > > > Instead of pausing them, containers are waiting for subsequent requests > > for > > > longer time(5~10s) than current implementation. > > > This is because pausing is also relatively expensive operation in > > > comparison to short-running invocation. > > > > > > Container lifecycle is managed in this way. > > > 1. When a container is created, it will add its information in ETCD. > > > 2. A queue will count the existing number of containers using above > > > information. > > > 3. Under heavy loads, the queue will request more containers if the > > number > > > of existing containers is less than its resource limit. > > > 4. When the container is removed, it will delete its information in > ETCD. > > > > > > > > > Again, I really appreciate all your feedbacks and questions. > > > If you have any further questions, kindly let me know. > > > > > > Best regards > > > Dominic > > > > > > > > > > > > 2019년 4월 5일 (금) 오전 1:24, Dascalita Dragos <[email protected]>님이 작성: > > > > > > > Hi Dominic, > > > > Thanks for sharing your ideas. IIUC (and pls keep me honest), the > goal > > of > > > > the new design is to improve activation performance. I personally > love > > > > this; performance is a critical non-functional feature of any FaaS > > > system. > > > > > > > > There’s something I’d like to call out: the management of containers > > in a > > > > FaaS system could be compared to a JVM. A JVM allocates objects in > > > memory, > > > > and GC them. A FaaS system allocates containers to run actions, and > it > > > GCs > > > > them when they become idle. If we could look at OW's scheduling from > > this > > > > perspective, we could reuse the proven patterns in the JVM vs > inventing > > > > something new. I’d be interested on any GC implications in the new > > > design, > > > > meaning how idle actions get removed, and how is that orchestrated. > > > > > > > > Thanks, > > > > dragos > > > > > > > > > > > > On Thu, Apr 4, 2019 at 8:40 AM Matt Sicker <[email protected]> wrote: > > > > > > > > > Would it make sense to define an OpenWhisk Improvement/Enhancement > > > > > Propoposal or similar that various other Apache projects do? We > could > > > > > call them WHIPs or something. :) > > > > > > > > > > On Thu, 4 Apr 2019 at 09:09, David P Grove <[email protected]> > > wrote: > > > > > > > > > > > > > > > > > > Dominic Kim <[email protected]> wrote on 04/04/2019 04:37:19 > AM: > > > > > > > > > > > > > > I have proposed a new architecture. > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture > > > > > > +proposal > > > > > > > > > > > > > > It includes many controversial agendas and breaking changes. > > > > > > > So I would like to form a general consensus on them. > > > > > > > > > > > > > > > > > > > Hi Dominic, > > > > > > > > > > > > There's much to like about the proposal. Thank you for > > > writing > > > > > it > > > > > > up. > > > > > > > > > > > > One meta-comment is that the work will have to be done > in a > > > way > > > > > so > > > > > > there are no actual "breaking changes". It has to be possible to > > > > > continue > > > > > > to configure the system using the existing architectures while > this > > > > work > > > > > > proceeds. I would expect this could be done via a new > LoadBalancer > > > and > > > > > > some deployment options (similar to how Lean OpenWhisk was done). > > If > > > > > work > > > > > > needs to be done to generalize the LoadBalancer SPI, that could > be > > > done > > > > > > early in the process. > > > > > > > > > > > > On the proposal itself, I wonder if the complexity of > > > > > Leader/Follower > > > > > > is actually needed? If a Scheduler crashes, it could be > restarted > > > and > > > > > then > > > > > > resume handling its assigned load. I think there should be > enough > > > > > > information in etcd for it to recover its current set of assigned > > > > > > ContainerProxys and carry on. Activations in its in memory > queues > > > > would > > > > > > be lost (bigger blast radius than the current architecture), but > I > > > > don't > > > > > > see that the Leader/Follower changes that (seems way too > expensive > > to > > > > be > > > > > > replicating every activation in the Follower Queues). The > > > > > Leader/Follower > > > > > > would allow for shorter downtime for those actions assigned to > the > > > > downed > > > > > > Scheduler, but at the cost of significant complexity. Is it > worth > > > it? > > > > > > > > > > > > Perhaps related to the Leader/Follower, its not clear to > me > > > how > > > > > > activation messages are being pulled from the action topic in > Kafka > > > > > during > > > > > > the Queue creation window. I think they have to go somewhere > > (because > > > > the > > > > > > is a mix of actions on a single Kafka topic and we can't stall > > other > > > > > > actions while waiting for a Queue to be created for a new > action), > > > but > > > > if > > > > > > you don't know yet which Scheduler is going to win the race to > be a > > > > > Leader > > > > > > how do you know where to put them? > > > > > > > > > > > > --dave > > > > > > > > > > > > > > > > > > > > -- > > > > > Matt Sicker <[email protected]> > > > > > > > > > > > > > > > > > > -- > > 周明宇 > > >
