Hi Team, When rate of activation is high (specially with concurrency enabled) in a specific invoker then its possible that rate of storage of activation in ArtifactStore lags behind rate of activation record generation.
For CouchDB this was somewhat mitigated by using a Batcher implementation which internally used CouchDB bulk insert api (#2812)[1]. However currently Batcher is configured with a queue size of Int.max [2] which can potentially lead to Invoker going OOM We tried to implement a similar support for CosmosDB (#4513)[3]. With our test we see that even with queue size of 100k was getting filled up quickly with higher load. For #2812 Rodric had mentioned the need to support backpressure [4] > we should perhaps open an issue to refactor the relevant code so that we can > backpressure the invoker feed when the activations can't be drained fast > enough. Currently the storeActivation call is not waited upon in ContainerProxy and hence there is no backpressure. Wanted to check on what possible options we can try if activations can't be drained fast enough. Option A - Wait till enqueue ------------------------------------- Somehow when calling storeACtivation wait till calls gets "enqueued". If it gets rejected due to queue full (assuming here that ArtifactStore has a queued storage) then we wait and retry few times. If it gets queued then we simply complete the call. With this we would not be occupying the invoker slot untill storage is complete. Instead we just occupy bit more till activations get enqueued to in memory buffer Option B - Overflow to Kafka and new OverflownActivationRecorderService ---------------------------------------------------------------------------------------------------- With enqueuing the activations there is always a chance of increase the heap pressure specially if activation size is large (user controlled aspect). So another option would be to overflow to Kafka for storing activation. If internal queue is full (queue size can now be small) then we would enque the record to Kafka. Kafka would in general have higher throughput rate compared to normal storage. Then on other end we can have a new micro service which would poll this "overflowActivations" topic and then persist them in ArtifactStore. Here we can even use single but partitioned topic if need to scale out the queue processing by multiple service instances if needed. Any feedback on possible option to pursue would be helpful! Chetan Mehrotra [1]: https://github.com/apache/incubator-openwhisk/pull/2812 [2]: https://github.com/apache/incubator-openwhisk/blob/master/common/scala/src/main/scala/org/apache/openwhisk/core/database/Batcher.scala#L56 [3]: https://github.com/apache/incubator-openwhisk/pull/4513 [4]: https://github.com/apache/incubator-openwhisk/pull/2812#pullrequestreview-67378126