[ 
https://issues.apache.org/jira/browse/OAK-5433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15818161#comment-15818161
 ] 

Thomas Mueller commented on OAK-5433:
-------------------------------------

> However each would not know about the other

Yes. I don't think they should be used at the same time, I was just trying to 
avoid _removing_ the CommitRateLimiter right now. I understand the 
CommitRateLimiter has many limits (well, _limit_ is part of it's name), and I 
don't question the need for a better solution. But until we have a better 
solution, I think it we should keep it.

> System Pacing Service
> ---------------------
>
>                 Key: OAK-5433
>                 URL: https://issues.apache.org/jira/browse/OAK-5433
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: core
>            Reporter: Stefan Eissing
>         Attachments: obs-pacing.diff
>
>
> h3. tl;dr
> By adding Pacing, suitable to the application {{oak}} is running in, a system 
> will dynamically adapt the load to its own capabilities. This effectively, in 
> tests, keeps the system stable and responsive under stress.
> h3. The Situation
> During experimental Lab tests on large clusters, it became clear that the a 
> web system based using oak is challenged by fluctuating load in relation to 
> its own capabilities. 
> When the load increases "too much" it shows the following symptoms:
> * event observation queues grow
> * maintenance tasks (on master) take too long
> * async tasks, triggered by requests, (e.g. workflows) accumulate
> and eventually
> * login sessions complain about freshness
> * revisions diffs are old and no longer in caches
> and sometimes
> * database lease times out and oak-core shuts down
> This problem can arise when outside requests increased, or when local 
> maintenance tasks occupy resources, or when available CPU diminishes due to 
> other processes or page faults or, or, or.
> Unfortunately, whenever the system becomes overburdened, the secondary 
> effects make the system even slower and, thus, more overburdened. This can 
> end in a vicious circle, making the system total unresponsive. Eventual 
> recovery is an option, not a guarantee.
> h3. Pacing
> By _Pacing_ I mean a system behaviour that tries to balance load in relation 
> to capabilities. If the latter one drops, the load must be reduced until the 
> system recovers. This is related to what the {{CommitRateLimiter}} wanted to 
> achieve
> by monitoring observation queues.
> The design of the {{CommitRateLimiter}}  could be very efficient, if it only 
> know _which_ commits to delay. But it does not know the application that oak 
> runs in. I propose replacing the Limiter by a {{PacingService}} that can be 
> provided by the application using oak. The service will get the data about 
> the current commit, queue length and limits. Whatever else it does remains 
> opaque. It may raise a proper exception to indicate that the commit shall 
> fail. But mostly, it is expected to delay those commits that would negatively 
> affect system stability.
> h3. An Example
> In a proof of concept, an AEM system was blasted with endless uploads on 
> multiple connections in order to eventually overwhelm queues. The a pacing 
> was patched into oak-core that delayed commits from servlet requests and from 
> certain workflows for some milliseconds until the queue length shrank again. 
> The pacing had a maximum wait time that would make the commit fail.
> The pacer was configured to trigger at 75% of maximum queue length and the 
> system was blasted with uploads again. In the tests:
> # the max queue length stayed under 80%
> # no upload did reach the maximum time, all succeeded
> The system adapted the external load to its capabilities successfully. 
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to