Dear Markus.
Thank you for the great work!

I think this is a good approach in the big picture.


I have a few questions.

1. Buffering of activation and failure handling.

As of now, Kafka acts as a kind of buffer in case activation processing is
a bit delayed due to some reasons such as invoker failure.
If Kafka is only used for the overflowing case, how can it guarantee "at
least once" activation processing?
For example, if a controller receives the requests and it crashed before
the activation is complete. How can other alive controllers or the restored
controller handle it?


2. A bottleneck in ContainerManager.

Now ContainerManage has many critical logics.
It takes care of the container lifecycle, logging, scheduling and so on.
Also, it should be aware of whole container state as well as controller
status and distribute containers among alive controllers.
It might need to do some health checking to/from all controllers and
containers(may be orchestrator such as k8s).

I think in this case ContainerManager can be a bottleneck as the size of
the cluster grows.
And if we add more nodes to scale out ContainerManager or to prepare for
the SPOF, then all states of ContainerManager should be shared among all
nodes.
If we take master/slave approach, the master would become a bottleneck at
some point, and if we take a clustering approach, we need some mechanism to
synchronize the cluster status among ContainerManagers.
And this procedure should be done in 1 or 2 order of magnitude in
milliseconds.

Do you have anything in your mind to handle this?


3. Imbalance among controllers.

I think there could be some imbalance among controllers.
For example, there are 3 controllers with 3, 1, and 1 containers
respectively for the given action.
In some case, 1 containers in the controller1 might be overloaded but 2
containers in controller2 can be available.
If the number of containers for the given action belongs to each controller
varies, it could happen more easily.
This is because controllers are not aware of the status of other
controllers.
So in some case, some action containers are overloaded but the others may
handle just moderate requests.
Then each controller may request more containers instead of utilizing
existing(but in other controllers) containers, and this can lead to the
waste of resources.


4. How do controllers determine whether to create more containers?

Let's say, a controller has only one container for the given action.
How controllers recognize this container is overloaded and need more
containers to create?
If the action execution time is short, it can calculate the number of
buffered activation for the given action.
But the action execution time is long, let's say 1 min or 2 mins, then even
though there is only 1 activation request in the buffer, the controller
needs to create more containers.
(Because subsequent activation request will be delayed for 1 or 2mins.)
Since we cannot know the execution time of action in advance, we may need a
sort of timeout(of activation response) approach for all actions.
But still, we cannot know how much time of execution are remaining for the
given action after the timeout occurred.
Further, if a user requests 100 or 200 concurrent invocations with a 2
mins-long action, all subsequent requests will suffer from the latency
overhead of timeout.



Thanks
Best regards
Dominic.




2018-07-18 22:45 GMT+09:00 Martin Gencur <mgen...@redhat.com>:

> On 18.7.2018 14:41, Markus Thoemmes wrote:
>
>> Hi Martin,
>>
>> thanks for the great questions :)
>>
>> thinking about scalability and the edge case. When there are not
>>> enough
>>> containers and new controllers are being created, and all of them
>>> redirect traffic to the controllers with containers, doesn't it mean
>>> overloading the available containers a lot? I'm curious how we
>>> throttle the traffic in this case.
>>>
>> True, the first few requests will overload the controller that owns the
>> very first container. That one will request new containers immediately,
>> which will then be distributed to all existing Controllers by the
>> ContainerManager. An interesting wrinkle here is, that you'd want the
>> overloading requests to be completed by the Controllers that sent it to the
>> "single-owning-Controller".
>>
>
> Ah, got it. So it is a pretty common scenario. Scaling out controllers and
> containers. I thought this is a case where we reach a limit of created
> containers and no more containers can be created.
>
>
>   What we could do here is:
>>
>> Controller0 owns ContainerA1
>> Controller1 relays requests for A to Controller0
>> Controller0 has more requests than it can handle, so it requests
>> additional containers. All requests coming from Controller1 will be
>> completed with a predefined message (for example "HTTP 503 overloaded" with
>> a specific header say "X-Return-To-Sender-By: Controller0")
>> Controller1 recognizes this as "okay, I'll wait for containers to
>> appear", which will eventually happen (because Controller0 has already
>> requested them) so it can route and complete those requests on its own.
>> Controller1 will now no longer relay requests to Controller0 but will
>> request containers itself (acknowledging that Controller0 is already
>> overloaded).
>>
>
> Yeah, I think it makes sense.
>
>
>> I guess the other approach would be to block creating new controllers
>>> when there are no containers available as long as we don't want to
>>> overload the existing containers. And keep the overflowing workload
>>> in Kafka as well.
>>>
>> Right, the second possibility is to use a pub/sub (not necessarily Kafka)
>> queue between Controllers. Controller0 subscribes to a topic for action A
>> because it owns a container for it. Controller1 doesn't own a container
>> (yet) and publishes a message as overflow to topic A. The wrinkle in this
>> case is, that Controller0 can't complete the request but needs to send it
>> back to Controller1 (where the HTTP connection is opened from the client).
>>
>> Does that make sense?
>>
>
> I was rather thinking about blocking the creation of Controller1 in this
> case and responding to the client that the system is overloaded. But the
> first approach seems better because it's a pretty common use case (not
> reaching the limit of created containers).
>
> Thanks!
> Martin
>
>
>> Cheers,
>> Markus
>>
>>
>

Reply via email to