Re: New architecture proposal

Dascalita Dragos Tue, 09 Apr 2019 09:36:58 -0700

"...When there is no more activation message, ContainerProxy will be wait
for the given time(configurable) and just stop...."


How does the system allocate and de-allocate resources when it's congested
?
I'm thinking at the use case where the system receives a batch of
activations that require 60% of all cluster resources. Once those
activations finish, a different batch of activations are received, and this
time the new batch requires new actions to be cold-started; these new
activations require a total of 80% of the overall cluster resources. Unless
the previous actions are removed, the cluster is over-allocated. In the
current model would the cluster process 1/2 of the new activations b/c it
needs to wait for the previous actions to stop by themselves ?

On Sun, Apr 7, 2019 at 7:34 PM Dominic Kim <[email protected]> wrote:

> Hi Mingyu
>
> Thank you for the good questions.
>
> Before answering to your question, I will share the Lease in ETCD first.
> ETCD has a data model which is disappear after given time if there is no
> relevant keepalive on it, the Lease.
>
> So once you grant a new lease, you can put it with data in each operation
> such as put, putTxn(transaction), etc.
> If there is no keep-alive for the given(configurable) time, inserted data
> will be gone.
>
> In my proposal, most of data in ETCD rely on a lease.
> For example, each scheduler stores their endpoint information(for queue
> creation) with a lease. Each queue stores their information(for activation)
> in ETCD with a lease.
> (It is overhead to do keep-alive in each memory queue separately, I
> introduced EtcdKeepAliveService to share one global lease among all queues
> in a same scheduler.)
> Each ContainerProxy store their information in ETCD with a lease so that
> when a queue tries to create a container, they can easily count the number
> of existing containers with "Count" API.
> Both data are stored with a lease, if one scheduler or invoker are failed,
> keep-alive for the given lease is not continued, and finally those data
> will be removed.
>
> Follower queues are watching on the leader queue information. If there are
> any changes,(the data will be removed upon scheduler failure) they can
> receive the notification and start new leader election.
> When a scheduler is failed, ContainerProxys which were communicating with a
> queue in that scheduler, will receive a connection error.
> Then they are designed to access to the ETCD again to figure out the
> endpoint of the leader queue.
> As one of followers becomes a new leader, ContainerProxys can connect to
> the new leader.
>
> One thing to note here is, there is only one QueueManager in each
> scheduler.
> One QueueManager holds all queues and delegate requests to the proper queue
> in respond to "fetch" requests.
>
> In short, all endpoints data are stored in ETCD and they are renewed based
> on keep-alive and lease.
> Each components are designed to access ETCD when the failure detected and
> connect to a new(failed-over) scheduler.
>
> I hope it is useful to you.
> And I think when I and my colleagues open PRs, we need to add more detail
> design along with them.
>
> If you have any further questions, kindly let me know.
>
> Thanks
> Best regards
> Dominic
>
>
>
> 2019년 4월 6일 (토) 오전 11:28, Mingyu Zhou <[email protected]>님이 작성:
>
> > Dear Dominic,
> >
> > Thanks for your proposal. It is very inspirational and it looks
> promising.
> >
> > I'd like to ask some questions about the fall over/failure recovery
> > mechanism of the scheduler component.
> >
> > IIUC, a scheduler instance hosts multiple queue managers. If a scheduler
> is
> > down, we will lose multiple queue managers. Thus, there should be some
> form
> > of failure recovery of queue managers and it raises the following
> > questions:
> >
> > 1. In your proposal, how the failure of a scheduler is detected? I.e.,
> > when a scheduler instance is down and some queue manager become
> > unreachable, which component will be aware of this unavailability and
> then
> > trigger the recovery procedure?
> >
> > 2. How should the failure be recovered and lost queue managers be brought
> > back to life? Specifically, in your proposal, you designed a hot
> > standing-by pairing of queue managers (one leader/two followers). Then
> how
> > should the new leader be selected in face of scheduler crash? And do we
> > need to allocate a new queue manager to maintain the
> > one-leader-two-follower configuration?
> >
> > 3. How will the other components in the system learn the new
> configuration
> > after a fall over? For example, how will the pool balancer discover the
> new
> > state of the scheduler it managers and change its policy to distribute
> > queue creation requests?
> >
> > Thanks
> > Mingyu Zhou
> >
> > On Fri, Apr 5, 2019 at 2:56 PM Dominic Kim <[email protected]> wrote:
> >
> > > Dear David, Matt, and Dascalita.
> > > Thank you for your interest in my proposal.
> > >
> > > Let me answer your questions one by one.
> > >
> > > @David
> > > Yes, I will(and actually already did) implement all components based on
> > > SPI.
> > > The reason why I said "breaking changes" is, my proposal will affect
> most
> > > of components drastically.
> > > For example, InvokerReactive will become a SPI and current
> > InvokerReactive
> > > will become one of its concrete implementation.
> > > My load balancer and throttler are also based on the current SPI.
> > > So though my implementation would be included in OpenWhisk, downstreams
> > > still can take advantage of existing implementation such as
> > > ShardingPoolBalancer.
> > >
> > > Regarding Leader/Follower, a fair point.
> > > The reason why I introduced such a model is to prepare for the future
> > > enhancement.
> > > Actually, I reached a conclusion that memory based activation passing
> > would
> > > be enough for OpenWhisk in terms of message persistence.
> > > But it is just my own opinion and community may want more rigid level
> of
> > > persistence.
> > > I naively thought we can add replication and HA logic in the queue
> which
> > > are similar to the one in Kafka.
> > > And Leader/Follower would be a good base building block for this.
> > >
> > > Currently only a leader fetch activation messages from Kafka. Followers
> > > will be idle while watching the leadership change.
> > > Once the leadership is changed, one of followers will become a new
> leader
> > > and at that time, Kafka consumer for the new leader will be created.
> > > This is to minimize the failure handling time in the aspect of clients
> as
> > > you mentioned. It is also correct that this flow does not prevent
> > > activation messages lost on scheduler failure.
> > > But it's not that complex as activation messages are not replicated to
> > > followers and the number of followers are configurable.
> > > If we want, we can configure the number of required queue to 1, there
> > will
> > > be only one leader queue.
> > > (If we ok with the current level of persistence, I think we may not
> need
> > > more than 1 queue as you said.)
> > >
> > > Regarding pulling activation messages, each action will have its own
> > Kafka
> > > topic.
> > > It is same with what I proposed in my previous proposals.
> > > When an action is created, a Kafka topic for the action will be
> created.
> > > So each leader queue(consumer) will fetch activation messages from its
> > own
> > > Kafka topic and there would be no intervention among actions.
> > >
> > > When I and my colleagues open PRs for each component, we will add
> detail
> > > component design.
> > > It would help you guys understand the proposal more.
> > >
> > > @Matt
> > > Thank you for the suggestion.
> > > If I change the name of it now, it would break the link in this thread.
> > > I would use the name you suggested when I open a PR.
> > >
> > >
> > > @Dascalita
> > >
> > > Interesting idea.
> > > Any GC patterns do you keep in your mind to apply in OpenWhisk?
> > >
> > > In my proposal, the container GC is similar to what OpenWhisk does
> these
> > > days.
> > > Each container will autonomously fetch activations from the queue.
> > > Whenever they finish invocation of one activation, they will fetch the
> > next
> > > request and invoke it.
> > > In this sense, we can maximize the container reuse.
> > >
> > > When there is no more activation message, ContainerProxy will be wait
> for
> > > the given time(configurable) and just stop.
> > > One difference is containers are no more paused, they are just removed.
> > > Instead of pausing them, containers are waiting for subsequent requests
> > for
> > > longer time(5~10s) than current implementation.
> > > This is because pausing is also relatively expensive operation in
> > > comparison to short-running invocation.
> > >
> > > Container lifecycle is managed in this way.
> > > 1. When a container is created, it will add its information in ETCD.
> > > 2. A queue will count the existing number of containers using above
> > > information.
> > > 3. Under heavy loads, the queue will request more containers if the
> > number
> > > of existing containers is less than its resource limit.
> > > 4. When the container is removed, it will delete its information in
> ETCD.
> > >
> > >
> > > Again, I really appreciate all your feedbacks and questions.
> > > If you have any further questions, kindly let me know.
> > >
> > > Best regards
> > > Dominic
> > >
> > >
> > >
> > > 2019년 4월 5일 (금) 오전 1:24, Dascalita Dragos <[email protected]>님이 작성:
> > >
> > > > Hi Dominic,
> > > > Thanks for sharing your ideas. IIUC (and pls keep me honest), the
> goal
> > of
> > > > the new design is to improve activation performance. I personally
> love
> > > > this; performance is a critical non-functional feature of any FaaS
> > > system.
> > > >
> > > > There’s something I’d like to call out: the management of containers
> > in a
> > > > FaaS system could be compared to a JVM. A JVM allocates objects in
> > > memory,
> > > > and GC them. A FaaS system allocates containers to run actions, and
> it
> > > GCs
> > > > them when they become idle. If we could look at OW's scheduling from
> > this
> > > > perspective, we could reuse the proven patterns in the JVM vs
> inventing
> > > > something new. I’d be interested on any GC implications in the new
> > > design,
> > > > meaning how idle actions get removed, and how is that orchestrated.
> > > >
> > > > Thanks,
> > > > dragos
> > > >
> > > >
> > > > On Thu, Apr 4, 2019 at 8:40 AM Matt Sicker <[email protected]> wrote:
> > > >
> > > > > Would it make sense to define an OpenWhisk Improvement/Enhancement
> > > > > Propoposal or similar that various other Apache projects do? We
> could
> > > > > call them WHIPs or something. :)
> > > > >
> > > > > On Thu, 4 Apr 2019 at 09:09, David P Grove <[email protected]>
> > wrote:
> > > > > >
> > > > > >
> > > > > > Dominic Kim <[email protected]> wrote on 04/04/2019 04:37:19
> AM:
> > > > > > >
> > > > > > > I have proposed a new architecture.
> > > > > > >
> > > >
> https://cwiki.apache.org/confluence/display/OPENWHISK/New+architecture
> > > > > > +proposal
> > > > > > >
> > > > > > > It includes many controversial agendas and breaking changes.
> > > > > > > So I would like to form a general consensus on them.
> > > > > > >
> > > > > >
> > > > > > Hi Dominic,
> > > > > >
> > > > > >         There's much to like about the proposal.  Thank you for
> > > writing
> > > > > it
> > > > > > up.
> > > > > >
> > > > > >         One meta-comment is that the work will have to be done
> in a
> > > way
> > > > > so
> > > > > > there are no actual "breaking changes".  It has to be possible to
> > > > > continue
> > > > > > to configure the system using the existing architectures while
> this
> > > > work
> > > > > > proceeds.  I would expect this could be done via a new
> LoadBalancer
> > > and
> > > > > > some deployment options (similar to how Lean OpenWhisk was done).
> > If
> > > > > work
> > > > > > needs to be done to generalize the LoadBalancer SPI, that could
> be
> > > done
> > > > > > early in the process.
> > > > > >
> > > > > >         On the proposal itself, I wonder if the complexity of
> > > > > Leader/Follower
> > > > > > is actually needed?  If a Scheduler crashes, it could be
> restarted
> > > and
> > > > > then
> > > > > > resume handling its assigned load.  I think there should be
> enough
> > > > > > information in etcd for it to recover its current set of assigned
> > > > > > ContainerProxys and carry on.   Activations in its in memory
> queues
> > > > would
> > > > > > be lost (bigger blast radius than the current architecture), but
> I
> > > > don't
> > > > > > see that the Leader/Follower changes that (seems way too
> expensive
> > to
> > > > be
> > > > > > replicating every activation in the Follower Queues).   The
> > > > > Leader/Follower
> > > > > > would allow for shorter downtime for those actions assigned to
> the
> > > > downed
> > > > > > Scheduler, but at the cost of significant complexity.  Is it
> worth
> > > it?
> > > > > >
> > > > > >         Perhaps related to the Leader/Follower, its not clear to
> me
> > > how
> > > > > > activation messages are being pulled from the action topic in
> Kafka
> > > > > during
> > > > > > the Queue creation window. I think they have to go somewhere
> > (because
> > > > the
> > > > > > is a mix of actions on a single Kafka topic and we can't stall
> > other
> > > > > > actions while waiting for a Queue to be created for a new
> action),
> > > but
> > > > if
> > > > > > you don't know yet which Scheduler is going to win the race to
> be a
> > > > > Leader
> > > > > > how do you know where to put them?
> > > > > >
> > > > > > --dave
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Matt Sicker <[email protected]>
> > > > >
> > > >
> > >
> >
> >
> > --
> > 周明宇
> >
>

Re: New architecture proposal

Reply via email to