Stefan Eissing created OAK-5433:
-----------------------------------

             Summary: System Pacing Service
                 Key: OAK-5433
                 URL: https://issues.apache.org/jira/browse/OAK-5433
             Project: Jackrabbit Oak
          Issue Type: New Feature
          Components: core
            Reporter: Stefan Eissing
         Attachments: obs-pacing.diff

h3. tl;dr

By adding Pacing, suitable to the application {{oak}} is running in, a system 
will dynamically adapt the load to its own capabilities. This effectively, in 
tests, keeps the system stable and responsive under stress.

h3. The Situation
During experimental Lab tests on large clusters, it became clear that the a web 
system based using oak is challenged by fluctuating load in relation to its own 
capabilities. 

When the load increases "too much" it shows the following symptoms:
* event observation queues grow
* maintenance tasks (on master) take too long
* async tasks, triggered by requests, (e.g. workflows) accumulate
and eventually
* login sessions complain about freshness
* revisions diffs are old and no longer in caches
and sometimes
* database lease times out and oak-core shuts down

This problem can arise when outside requests increased, or when local 
maintenance tasks occupy resources, or when available CPU diminishes due to 
other processes or page faults or, or, or.

Unfortunately, whenever the system becomes overburdened, the secondary effects 
make the system even slower and, thus, more overburdened. This can end in a 
vicious circle, making the system total unresponsive. Eventual recovery is an 
option, not a guarantee.

h3. Pacing

By _Pacing_ I mean a system behaviour that tries to balance load in relation to 
capabilities. If the latter one drops, the load must be reduced until the 
system recovers. This is related to what the {{CommitRateLimiter}} wanted to 
achieve
by monitoring observation queues.

The design of the {{CommitRateLimiter}}  could be very efficient, if it only 
know _which_ commits to delay. But it does not know the application that oak 
runs in. I propose replacing the Limiter by a {{PacingService}} that can be 
provided by the application using oak. The service will get the data about the 
current commit, queue length and limits. Whatever else it does remains opaque. 
It may raise a proper exception to indicate that the commit shall fail. But 
mostly, it is expected to delay those commits that would negatively affect 
system stability.

h3. An Example

In a proof of concept, an AEM system was blasted with endless uploads on 
multiple connections in order to eventually overwhelm queues. The a pacing was 
patched into oak-core that delayed commits from servlet requests and from 
certain workflows for some milliseconds until the queue length shrank again. 
The pacing had a maximum wait time that would make the commit fail.

The pacer was configured to trigger at 75% of maximum queue length and the 
system was blasted with uploads again. In the tests:
# the max queue length stayed under 80%
# no upload did reach the maximum time, all succeeded

The system adapted the external load to its capabilities successfully. 



  








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to