Backpressure for slow activation storage in Invoker

Chetan Mehrotra Wed, 19 Jun 2019 23:21:30 -0700

Hi Team,

When rate of activation is high (specially with concurrency enabled)
in a specific invoker then its possible that rate of storage of
activation in ArtifactStore lags behind rate of activation record
generation.

For CouchDB this was somewhat mitigated by using a Batcher
implementation which internally used CouchDB bulk insert api
(#2812)[1]. However currently Batcher is configured with a queue size
of Int.max [2] which can potentially lead to Invoker going OOM

We tried to implement a similar support for CosmosDB (#4513)[3]. With
our test we see that even with queue size of 100k was getting filled
up quickly with higher load.

For #2812 Rodric had mentioned the need to support backpressure [4]

> we should perhaps open an issue to refactor the relevant code so that we can
> backpressure the invoker feed when the activations can't be drained fast
> enough.

Currently the storeActivation call is not waited upon in
ContainerProxy and hence there is no backpressure. Wanted to check on
what possible options we can try if activations can't be drained fast
enough.

Option A - Wait till enqueue
-------------------------------------

Somehow when calling storeACtivation wait till calls gets "enqueued".
If it gets rejected due to queue full (assuming here that
ArtifactStore has a queued storage) then we wait and retry few times.
If it gets queued then we simply complete the call.

With this we would not be occupying the invoker slot untill storage is
complete. Instead we just occupy bit more till activations get
enqueued to in memory buffer

Option B - Overflow to Kafka and new OverflownActivationRecorderService
----------------------------------------------------------------------------------------------------

With enqueuing the activations there is always a chance of increase
the heap pressure specially if activation size is large (user
controlled aspect). So another option would be to overflow to Kafka
for storing activation.

If internal queue is full (queue size can now be small) then we would
enque the record to Kafka. Kafka would in general have higher
throughput rate compared to normal storage.

Then on other end we can have a new micro service which would poll
this "overflowActivations" topic and then persist them in
ArtifactStore. Here we can even use single but partitioned topic if
need to scale out the queue processing by multiple service instances
if needed.

Any feedback on possible option to pursue would be helpful!

Chetan Mehrotra
[1]: https://github.com/apache/incubator-openwhisk/pull/2812
[2]:
https://github.com/apache/incubator-openwhisk/blob/master/common/scala/src/main/scala/org/apache/openwhisk/core/database/Batcher.scala#L56
[3]: https://github.com/apache/incubator-openwhisk/pull/4513
[4]:
https://github.com/apache/incubator-openwhisk/pull/2812#pullrequestreview-67378126

Backpressure for slow activation storage in Invoker

Reply via email to