[ 
https://issues.apache.org/jira/browse/OAK-5433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837602#comment-15837602
 ] 

Stefan Egli commented on OAK-5433:
----------------------------------

Given the setup of oak where the commits can happen at arbitrary speed but the 
downstream observation part is asynchronous, thus goes via queues, we know 
there is the risk that too quick an input speed overwhelms those queues. So I'm 
in favor of some sort of throttling (flow control) of the commits.

Now the question I think comes down to how exactly do we want this flow control 
to look like. I'd like to suggest the following classification:

# indiscriminate stop-the-world approach: this is what the CommitRateLimiter 
currently provides: being part of the commit (synchronously), it temporarily 
blocks any further commit when the queues are too big. It does this independent 
(indiscriminate) of the action that caused the commit
# qualified stop-the-world approach: this is how I interpret the suggested 
Pacing Service: also being part of the commit (thus synchronously) it 
temporarily blocks any further commit, but does that only when certain action 
(qualified) are the cause of the commit. Once a qualified action is identified 
however, the blocking affects all further commits, ie also those done by 
non-suspicious actions.
# (qualified) stop-the-thread approach: this would be an approach whereby only 
the calling thread is blocked, not the whole commit-subsystem. Ideally the 
blocking would happen similar to the Pacing Service, ie dependent on what type 
of action is doing the commit. But the main difference to 2. is that it has no 
side-effects on other threads.
# prevent-the-start-of-an-action approach: going further than 3. could be that 
we would not even start an action (eg a job) if queues are too long, thus not 
even having the need to delay (waste) a thread in the first place. This I think 
would work fine for eg Sling Jobs, but might not be trivial/possible for HTTP 
uploads for example.

I guess what I'm trying to say is that perhaps the stop-the-world approach of 1 
and 2 are quite intrusive and should maybe only be used as a fallback if all 
other attemps to throttle fail. Meaning that perhaps we should rather try to 
aim for 3 and/or 4 than 2 - and that those might behave more deterministic than 
stop-the-world.

Implementation-wise we might still implement 3 as part of oak eg on the JCR 
session level (and 4 perhaps via a utility method eg getObservationLength).

Whatever we decide upon I'd say the flow control strategy should become 
configureable in one place indeed.

> System Pacing Service
> ---------------------
>
>                 Key: OAK-5433
>                 URL: https://issues.apache.org/jira/browse/OAK-5433
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: core
>            Reporter: Stefan Eissing
>         Attachments: ing-6k-t500-7500-6-wflauncher.png, 
> oak-trunk-pacer-v1.diff, obs-pacing.diff
>
>
> h3. tl;dr
> By adding Pacing, suitable to the application {{oak}} is running in, a system 
> will dynamically adapt the load to its own capabilities. This effectively, in 
> tests, keeps the system stable and responsive under stress.
> h3. The Situation
> During experimental Lab tests on large clusters, it became clear that the a 
> web system based using oak is challenged by fluctuating load in relation to 
> its own capabilities. 
> When the load increases "too much" it shows the following symptoms:
> * event observation queues grow
> * maintenance tasks (on master) take too long
> * async tasks, triggered by requests, (e.g. workflows) accumulate
> and eventually
> * login sessions complain about freshness
> * revisions diffs are old and no longer in caches
> and sometimes
> * database lease times out and oak-core shuts down
> This problem can arise when outside requests increased, or when local 
> maintenance tasks occupy resources, or when available CPU diminishes due to 
> other processes or page faults or, or, or.
> Unfortunately, whenever the system becomes overburdened, the secondary 
> effects make the system even slower and, thus, more overburdened. This can 
> end in a vicious circle, making the system total unresponsive. Eventual 
> recovery is an option, not a guarantee.
> h3. Pacing
> By _Pacing_ I mean a system behaviour that tries to balance load in relation 
> to capabilities. If the latter one drops, the load must be reduced until the 
> system recovers. This is related to what the {{CommitRateLimiter}} wanted to 
> achieve
> by monitoring observation queues.
> The design of the {{CommitRateLimiter}}  could be very efficient, if it only 
> know _which_ commits to delay. But it does not know the application that oak 
> runs in. I propose replacing the Limiter by a {{PacingService}} that can be 
> provided by the application using oak. The service will get the data about 
> the current commit, queue length and limits. Whatever else it does remains 
> opaque. It may raise a proper exception to indicate that the commit shall 
> fail. But mostly, it is expected to delay those commits that would negatively 
> affect system stability.
> h3. An Example
> In a proof of concept, an AEM system was blasted with endless uploads on 
> multiple connections in order to eventually overwhelm queues. The a pacing 
> was patched into oak-core that delayed commits from servlet requests and from 
> certain workflows for some milliseconds until the queue length shrank again. 
> The pacing had a maximum wait time that would make the commit fail.
> The pacer was configured to trigger at 75% of maximum queue length and the 
> system was blasted with uploads again. In the tests:
> # the max queue length stayed under 80%
> # no upload did reach the maximum time, all succeeded
> The system adapted the external load to its capabilities successfully. 
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to