[ 
https://issues.apache.org/jira/browse/OAK-5433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837624#comment-15837624
 ] 

Stefan Egli commented on OAK-5433:
----------------------------------

IIUC then the delaying happens in {{onEventQueuing}} which is invoked within 
{{BackgroundObserver.contentChanged}}. The latter is part of the (synchronously 
called) observers. Thus if you block one observer you block the entire commit 
'subsystem', ie any other thread that wants to do a commit.

> System Pacing Service
> ---------------------
>
>                 Key: OAK-5433
>                 URL: https://issues.apache.org/jira/browse/OAK-5433
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: core
>            Reporter: Stefan Eissing
>         Attachments: ing-6k-t500-7500-6-wflauncher.png, 
> oak-trunk-pacer-v1.diff, obs-pacing.diff
>
>
> h3. tl;dr
> By adding Pacing, suitable to the application {{oak}} is running in, a system 
> will dynamically adapt the load to its own capabilities. This effectively, in 
> tests, keeps the system stable and responsive under stress.
> h3. The Situation
> During experimental Lab tests on large clusters, it became clear that the a 
> web system based using oak is challenged by fluctuating load in relation to 
> its own capabilities. 
> When the load increases "too much" it shows the following symptoms:
> * event observation queues grow
> * maintenance tasks (on master) take too long
> * async tasks, triggered by requests, (e.g. workflows) accumulate
> and eventually
> * login sessions complain about freshness
> * revisions diffs are old and no longer in caches
> and sometimes
> * database lease times out and oak-core shuts down
> This problem can arise when outside requests increased, or when local 
> maintenance tasks occupy resources, or when available CPU diminishes due to 
> other processes or page faults or, or, or.
> Unfortunately, whenever the system becomes overburdened, the secondary 
> effects make the system even slower and, thus, more overburdened. This can 
> end in a vicious circle, making the system total unresponsive. Eventual 
> recovery is an option, not a guarantee.
> h3. Pacing
> By _Pacing_ I mean a system behaviour that tries to balance load in relation 
> to capabilities. If the latter one drops, the load must be reduced until the 
> system recovers. This is related to what the {{CommitRateLimiter}} wanted to 
> achieve
> by monitoring observation queues.
> The design of the {{CommitRateLimiter}}  could be very efficient, if it only 
> know _which_ commits to delay. But it does not know the application that oak 
> runs in. I propose replacing the Limiter by a {{PacingService}} that can be 
> provided by the application using oak. The service will get the data about 
> the current commit, queue length and limits. Whatever else it does remains 
> opaque. It may raise a proper exception to indicate that the commit shall 
> fail. But mostly, it is expected to delay those commits that would negatively 
> affect system stability.
> h3. An Example
> In a proof of concept, an AEM system was blasted with endless uploads on 
> multiple connections in order to eventually overwhelm queues. The a pacing 
> was patched into oak-core that delayed commits from servlet requests and from 
> certain workflows for some milliseconds until the queue length shrank again. 
> The pacing had a maximum wait time that would make the commit fail.
> The pacer was configured to trigger at 75% of maximum queue length and the 
> system was blasted with uploads again. In the tests:
> # the max queue length stayed under 80%
> # no upload did reach the maximum time, all succeeded
> The system adapted the external load to its capabilities successfully. 
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to