Hi,

today, we execute the user-action in the invoker, send the active-ack
back to the controller and collect logs afterwards.
This has the following implication:
- controller receives the active ack, so it thinks the slot on the
invoker is free again.
- BUT the invoker is still collecting logs, which means that the
activation has to wait until log collection is finished.
Especially when log-collection takes long (e.g. because of high CPU
load on the invoker-machine), user-actions have to wait longer and
longer over time.

If this happens, you will read the following message in the invoker:
`Rescheduling Run message, too many message in the pool, freePoolSize:
0 containers and 0 MB, busyPoolSize ...`

But it definitely makes sense to send the active-ack (at least for
blocking activations) to the controller as fast as possible, because
the controller should answer the request as fast as possible.

So my proposal is to differentiate between blocking and non-blocking
activations. The invoker today already knows, if it is blocking or
not.
If the activation is non-blocking, we wait with the active-ack until
log collection is finished.
If the activation is blocking, we send an active-ack with a field,
that logColleaction is not finished yet, like today and a second
active-ack, after log-collection is finished.

With this behaviour, the user gets its response as fast as possible on
blocking activations and the loadbalancer waits with dispatching,
until the slot is freed up.

I also did a test to verify performance.
For this test, I took a system with 100 invokers and space for 32
256MB actions on each invoker. (Two controllers, 1 Kafka)
I used our gatling test `BlockingInvokeOneActionSimulation`. The
action of the test writes one logline and returns the input paramters
again.
The test executed all activations blocking, which means, that two
active-acks have been sent per activation.
I used 2880 parallel connections, which should result in 90% system
utilisation (blackbox-fraction is set to 0).
As you can see, this scenario generates the most possible active-acks.
To the result:
The throughput per second is at 97% compared to the current master.
The response times are also nearly the same.
So there is nearly no regression in the worst case scenario.
In addition, I looked for the log-message I mentioned above in the
invoker. It has not been written in the test with my changes, but
thousands of times on the master.
For non-blocking requests I don't expect any regression, but the
waiting-time on the invoker should be less.

Another valid approach would be, to wait with the active-ack, until
log-collection is finished (independent of blocking or non-blocking).
If the action is executed blocking, we could say, that it's the users
responsibility to not log too much or to set the loglimit to 0, to get
fast responses.

Does anyone have an opinion, which of the two approaches we should
pursue. Or has anyone another idea?

Greetings
Christian

Reply via email to