Re: New scheduling algorithm proposal.

Dominic Kim Fri, 25 May 2018 01:59:27 -0700

Dear Whiskers.

I uploaded the material that I used to give a speech at last biweekly
meeting.


https://cwiki.apache.org/confluence/display/OPENWHISK/Autonomous+Container+Scheduling

This document mainly describes following things:

1. Current implementation details
2. Potential issues with current implementation
3. New scheduling algorithm proposal: Autonomous Container Scheduling
4. Review previous issues with new scheduling algorithm
5. Pros and cons of new scheduling algorithm
6. Performance evaluation with prototyping


I hope this is helpful for many Whiskers to understand the issues and new
proposal.

Thanks
Best regards
Dominic


2018-05-18 18:44 GMT+09:00 Dominic Kim <[email protected]>:

> Hi Tyson
> Thank you for comments.
>
> First, total data size in Kafka would not be changed, since the number of
> activation requests will be same though we activate more topics.
> Same amount of data will be just split into different topics, so there
> would be no need for more disk space in Kafka.
> But it will increase parallel processing at Kafka layer, more number of
> consumers will access to Kafka at the same time.
>
> The reason why active/inactive is meaningful in the scene is, the size of
> parallel processing will be dependent on it.
> Though we have 100k and more topics(actions), if only 10 actions are being
> invoked, there will be only 10 parallel processing.
> The number of inactive topics does not affect the performance of active
> consumers.
> If we really need to support 100k parallel action invocation, surely we
> need more nodes to handle them not only for just Kafka.
> Kafka can horizontally scale out and number of active topics at some point
> will always be lesser than the total number of topics,  Based on my
> benchmark results, I expect it is enough to take with scale-out of Kafka.
>
> Regarding topic cleanup, we don't need to clean them up by ourselves,
> Kafka will clean up expired data based on retention configuration.
> So if the topic is no more activated(the action is no more invoked), there
> would be no actual data though the topic exists.
> And as I said, even if data exists for some topics, they won't affect
> other active consumers if they are not activated.
>
> Regarding concurrent activation PR, it is very worthy change. And I
> recognize it is orthogonal to my PR.
> It will not only alleviate current performance issue but can be used with
> my changes as well.
>
> In current logics, controller schedule activations based on hash, your PR
> would be much effective if some changes are made on scheduling logic.
>
> Regarding bypassing kafka, I am inclined to use Kafka because it can act
> as a kind of buffer.
> If some activations are not processed due to some reasons such as lack of
> resources or invoker failure and so on, Kafka can keep them up for some
> times and guarantee `at most once` invocation though invocation might be a
> bit delayed.
>
> With regard to combined approach, I think that is great idea.
> For that, container states should be shared among invokers and they can
> send activation request to any containers(local, or remote).
> As so invokers will utilize warmed resources which are not located in its
> local.
>
> But it will also introduce some synchronization issue among controllers
> and invokers or it needs segregation between resources based scheduling at
> controller and real invocation.
> In the earlier case, since controller will schedule activations based on
> resources status, it is required to synchronize them in realtime.
> Invokers can send requests to any remote containers, there will be
> mismatch in resource status between controllers and invokers.
>
> In the later case, controller should be able to send requests to any
> invokers then invoker will schedule the activations.
> In this case also, invokers need to synchronize their container status
> among them.
>
> Under the situation all invokers have same resources status, if two
> invokers received same action invocation requests, it's not easy to control
> the traffic among them, because they will schedule requests to same
> containers. And if we take similar approach with what you suggested, to
> send intent to use the containers first, it will introduce increasing
> latency overhead as more and more invokers joined the cluster.
> I couldn't find any good way to handle this yet. And this is why I
> proposed autonomous containerProxy to enable location free scheduling.
>
> Finally regarding SPI, yes you are correct, ContainerProxy is highly
> dependent on ContainerPool, I will update my PR as you guided.
>
> Thanks
> Regards
> Dominic.
>
>
> 2018-05-18 2:22 GMT+09:00 Tyson Norris <[email protected]>:
>
>> Hi Dominic -
>>
>> I share similar concerns about an unbounded number of topics, despite
>> testing with 10k topics. I’m not sure a topic being considered active vs
>> inactive makes a difference from broker/consumer perspective? I think there
>> would minimally have to be some topic cleanup that happens, and I’m not
>> sure the impact of deleting topics in bulk will have on the system either.
>>
>> A couple of tangent notes related to container reuse to improve
>> performance:
>> - I’m putting together the concurrent activation PR[1] (to allow reuse of
>> an warm container for multiple activations concurrently); this can improve
>> throughput for those actions that can tolerate it (FYI per-action config is
>> not implemented yet). It suffers from similar inaccuracy of kafka message
>> ingestion at invoker “how many messages should I read”? But I think we can
>> tune this a bit by adding some intelligence to Invoker/MessageFeed like “if
>> I never see ContainerPool indicate it is busy, read more next time” - that
>> is, allow ContainerPool to backpressure MessageFeed based on ability to
>> consume, and not (as today) strictly on consume+process.
>>
>> - Another variant we are investigating is putting a ContainerPool into
>> Controller. This will prevent container reuse across controllers (bad!),
>> but will bypass kafka(good!). I think this will be plausible for actions
>> that support concurrency, and may be useful for anything that runs as
>> blocking to improve a few ms of latency, but I’m not sure of all the
>> ramifications yet.
>>
>>
>> Another (more far out) approach combines some of these is changing the
>> “scheduling” concept to be more resource reservation and garbage
>> collection. Specifically that the ContainerPool could be a combination of
>> self-managed resources AND remote managed resources. If no proper (warm)
>> container exists locally or remotely, a self-managed one is created, and
>> advertised. Other ContainerPool instances can leverage the remote resources
>> (containers). To pause or remove a container requires advertising intent to
>> change state, and giving clients time to veto. So there is some added
>> complication in the start/reserver/pause/rm container lifecycle, but the
>> case for reuse is maximized in best case scenario (concurrency tolerant
>> actions) and concurrency intolerant actions have a chance to leverage a
>> broader pool of containers (iff the ability to reserve a shared available
>> container is fast enough, compared to starting a new cold one). There is a
>> lot wrapped in there (how are resources advertised, what are the new states
>> of lifecycle, etc), so take this idea with a grain of salt.
>>
>>
>> Specific to your PR: do you need an SPI for ContainerProxy? Or can it
>> just be designated by the ContainerPool impl to use a specific
>> ContainerProxy variant? I think these are now and will continue to be tied
>> closely together, so would manage them as a single SPI.
>>
>> Thanks
>> Tyson
>>
>> [1] https://github.com/apache/incubator-openwhisk/pull/2795<http
>> s://github.com/apache/incubator-openwhisk/pull/2795#pullrequ
>> estreview-115830270>
>>
>> On May 16, 2018, at 10:42 PM, Dominic Kim <[email protected]<mailto:st
>> [email protected]>> wrote:
>>
>> Dear all.
>>
>> Does anyone have any comments on this?
>> Any comments or opinions would be greatly welcome : )
>>
>> I think we need around following changes to take this in.
>>
>> 1. SPI supports for ContainerProxy and ContainerPool ( I already opened PR
>> for this: https://na01.safelinks.protection.outlook.com/?url=https%3A%
>> 2F%2Fgithub.com%2Fapache%2Fincubator-openwhisk%2Fpull%
>> 2F3663&data=02%7C01%7Ctnorris%40adobe.com%7Cb2dd7c5686bc466f
>> 92f708d5bbb8f43d%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%
>> 7C636621325405375234&sdata=KYHyOGosc%2BqMIFLVVRO2gYCYZRegfQL
>> l%2BfGjSvtGI9k%3D&reserved=0)
>> 2. SPI supports for Throttling/Entitlement logics
>> 3. New loadbalancer
>> 4. New ContainerPool and ContainerProxy
>> 5. New notions for limit and logics to hanlde more fine-grained limit.(It
>> should be able to coexist with existing limit)
>>
>> If there's no any comments, I will open a PR based on it one by one.
>> Then it would be better for us to discuss on this because we can directly
>> discuss over basic implementation.
>>
>> Thanks
>> Regards
>> Dominic.
>>
>>
>>
>> 2018-05-09 11:42 GMT+09:00 Dominic Kim <[email protected]<mailto:st
>> [email protected]>>:
>>
>> One more thing to clarify is invoker parallelism.
>>
>> ## Invoker coordinates all requests.
>>
>> I tend to disagree with the "cannot take advantage of parallel
>> processing" bit. Everything in the invoker is parallelized after updating
>> its central state (which should take a **very** short amount of time
>> relative to actual action runtime). It is not really optimized to scale to
>> a lot of containers *yet*.
>>
>> More precisely, I think it is related to Kafka consumer rather than
>> invoker. Invoker logic can run in parallel. But `MessageFeed` seems not.
>> Once outstanding message size reaches max size, it will wait for messages
>> processed. If few activation messages for an action are not properly
>> handled, `MessageFeed` does not consume more message from Kafka(until any
>> messages are processed).
>> So subsequent messages for other actions cannot be fetched or delayed due
>> to unprocessed messages. This is why I mentioned invoker parallelism. I
>> think I should rephrase it as `MessageFeed` parallelism.
>>
>> As you know partition is unit of parallelism in Kafka. If we have multiple
>> partitions for activation topic, we can setup multiple consumers and it
>> will enable parallel processing for Kafka messages as well.
>> Since logics in invoker can already run in parallel, with this change, we
>> can process messages entirely in parallel.
>>
>> In my proposal, I split activation messages from container coordination
>> message(action parallelism), assign more partition for activation
>> messages(in-topic parallelism) and enable parallel processing with
>> multiple
>> consumers(containers).
>>
>>
>> Thanks
>> Regards
>> Dominic.
>>
>>
>>
>>
>>
>> 2018-05-08 19:34 GMT+09:00 Dominic Kim <[email protected]<mailto:st
>> [email protected]>>:
>>
>> Thank you for the response Markus and Christian.
>>
>> Yes I agree that we need to discuss this proposal in abstract way instead
>> in conjunction it with any specific technology because we can take better
>> software stack if possible.
>> Let me answer your questions line by line.
>>
>>
>> ## Does not wait for previous run.
>>
>> Yes it is valid thoughts. If we keep cumulating requests in the queue,
>> latency can be spiky especially in case execution time of action is huge.
>> So if we want to take this in, we need to find proper way to balance
>> creating more containers for latency and making existing containers handle
>> requests.
>>
>>
>> ## Not able to accurately control concurrent invocation.
>>
>> Ok I originally thought this is related to concurrent containers rather
>> than concurrent activations.
>> But I am still inclined to concurrent containers approach.
>> In current logic, it is dependent on factors other than real concurrent
>> invocations.
>> If RTT between controllers and invokers becomes high for some reasons,
>> controller will reject new requests though invokers are actually idle.
>>
>> ## TPS is not deterministic.
>>
>> I meant not deterministic TPS for just one user rather I meant
>> system-wide deterministic TPS.
>> Surely TPS can vary when heterogenous actions(which have different
>> execution time) are invoked.
>> But currently it's not easy to figure out what the TPS is with only 1
>> kind of action because it is changed based on not only heterogeneity of
>> actions but the number of users and namespaces.
>>
>> I think at least we need to be able to have this kind of official spec:
>> In case actions with 20 ms execution time are invoked, our system TPS is
>> 20,000 TPS(no matter how many users or namespaces are used).
>>
>>
>> Your understanding about my proposal is perfectly correct.
>> Small thing to add is, controller sends `ContainerCreation` request based
>> on processing speed of containers rather than availability of existing
>> containers.
>>
>> BTW, regarding your concern about Kafka topic, I think we may be fine
>> because,
>> the number of topics will be unbounded, but the number of active topics
>> will be bounded.
>>
>> If we take this approach, it is mandatory to limit retention bytes and
>> duration for each topics.
>> So the number of active topics is limited and actual data in them are
>> also limited, so I think that would be fine.
>>
>> But it is necessary to have optimal configurations for retention and many
>> benchmark to confirm this.
>>
>>
>> And I didn't get the meaning of eventual consistency of consumer lag.
>> You meant that is eventual consistent because it changes very quickly
>> even within one second?
>>
>>
>> Thanks
>> Regards
>> Dominic
>>
>>
>>
>> 2018-05-08 17:25 GMT+09:00 Markus Thoemmes <[email protected]<ma
>> ilto:[email protected]>>:
>>
>> Hey Dominic,
>>
>> Thank you for the very detailed writeup. Since there is a lot in here,
>> please allow me to rephrase some of your proposals to see if I understood
>> correctly. I'll go through point-by-point to try to keep it close to your
>> proposal.
>>
>> **Note:** This is a result of an extensive discussion of Christian
>> Bickel (@cbickel) and myself on this proposal. I used "I" throughout the
>> writeup for easier readability, but all of it can be read as "we".
>>
>> # Issues:
>>
>> ## Interventions of actions.
>>
>> That's a valid concern when using today's loadbalancer. This is
>> noisy-neighbor behavior that can happen today under the circumstances you
>> describe.
>>
>> ## Does not wait for previous run.
>>
>> True as well today. The algorithms used until today value correctness
>> over performance. You're right, that you could track the expected queue
>> occupation and schedule accordingly. That does have its own risks though
>> (what if your action has very spiky latency behavior?).
>>
>> I'd generally propose to break this out into a seperate discussion. It
>> doesn't really correlate to the other points, WDYT?
>>
>> ## Invoker coordinates all requests.
>>
>> I tend to disagree with the "cannot take advantage of parallel
>> processing" bit. Everything in the invoker is parallelized after updating
>> its central state (which should take a **very** short amount of time
>> relative to actual action runtime). It is not really optimized to scale to
>> a lot of containers *yet*.
>>
>> ## Not able to accurately control concurrent invocation.
>>
>> Well, the limits are "concurrent actions in the system". You should be
>> able to get 5 activations on the queue with today's mechanism. You should
>> get as many containers as needed to handle your load. For very
>> short-running actions, you might not need N containers to handle N
>> messages
>> in the queue.
>>
>> ## TPS is not deterministic.
>>
>> I'm wondering: Have TPS been deterministic for just one user? I'd argue
>> that this is a valid metric on its own kind. I agree that these numbers
>> can
>> drop significantly under heterogeneous load.
>>
>> # Proposal:
>>
>> I'll try to rephrase and add some bits of abstraction here and there to
>> see if I understood this correctly:
>>
>> The controller should schedule based on individual actions. It should
>> not send those to an arbitrary invoker but rather to something that
>> identifies those actions themselves (a kafka topic in your example). I'll
>> call this *PerActionContainerPool*. Those calls from the controller will
>> be
>> handled by each *ContainerProxy* directly rather than being threaded
>> through another "centralized" component (the invoker). The
>> *ContainerProxy*
>> is responsible for handling the "aftermath": Writing activation records,
>> collecting logs etc (like today).
>>
>> Iff the controller thinks that the existing containers cannot sustain
>> the load (i.e. if all containers are currently in use), it advises a
>> *ContainerCreationSystem* (all invokers combined in your case) to create a
>> new container. This container will be added to the
>> *PerActionContainerPool*.
>>
>> The invoker in your proposal has no scheduling logic at all (which is
>> sound with the issues lined out above) other than container creation
>> itself.
>>
>> # Conclusion:
>>
>> I like the proposal in the abstract way I've tried to phrase above. It
>> indeed amplifies warm-container usage and in general should be superior to
>> the more statistical approach of today's loadbalancer.
>>
>> I think we should discuss this proposal in an abstract,
>> non-technology-bound way. I do think that having so many kafka topics
>> including all the rebalancing needed can become an issue, especially
>> because the sheer number of kafka topics is unbounded. I also think that
>> the consumer lag is subject to eventual consistency and depending on how
>> eventual that is it can turn into queueing in your system, even though
>> that
>> wouldn't be necessary from a capacity perspective.
>>
>> I don't want to ditch the proposal because of those concerns though!
>>
>> As I said: The proposal itself makes a lot of sense and I like it a lot!
>> Let's not trap ourselves in the technology used today though. You're
>> proposing a major restructuring so we might as well think more
>> green-fieldy. WDYT?
>>
>> Cheers,
>> Christian and Markus
>>
>>
>>
>>
>>
>>
>

Re: New scheduling algorithm proposal.

Reply via email to