[ https://issues.apache.org/jira/browse/OAK-5433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837624#comment-15837624 ]
Stefan Egli commented on OAK-5433: ---------------------------------- IIUC then the delaying happens in {{onEventQueuing}} which is invoked within {{BackgroundObserver.contentChanged}}. The latter is part of the (synchronously called) observers. Thus if you block one observer you block the entire commit 'subsystem', ie any other thread that wants to do a commit. > System Pacing Service > --------------------- > > Key: OAK-5433 > URL: https://issues.apache.org/jira/browse/OAK-5433 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: core > Reporter: Stefan Eissing > Attachments: ing-6k-t500-7500-6-wflauncher.png, > oak-trunk-pacer-v1.diff, obs-pacing.diff > > > h3. tl;dr > By adding Pacing, suitable to the application {{oak}} is running in, a system > will dynamically adapt the load to its own capabilities. This effectively, in > tests, keeps the system stable and responsive under stress. > h3. The Situation > During experimental Lab tests on large clusters, it became clear that the a > web system based using oak is challenged by fluctuating load in relation to > its own capabilities. > When the load increases "too much" it shows the following symptoms: > * event observation queues grow > * maintenance tasks (on master) take too long > * async tasks, triggered by requests, (e.g. workflows) accumulate > and eventually > * login sessions complain about freshness > * revisions diffs are old and no longer in caches > and sometimes > * database lease times out and oak-core shuts down > This problem can arise when outside requests increased, or when local > maintenance tasks occupy resources, or when available CPU diminishes due to > other processes or page faults or, or, or. > Unfortunately, whenever the system becomes overburdened, the secondary > effects make the system even slower and, thus, more overburdened. This can > end in a vicious circle, making the system total unresponsive. Eventual > recovery is an option, not a guarantee. > h3. Pacing > By _Pacing_ I mean a system behaviour that tries to balance load in relation > to capabilities. If the latter one drops, the load must be reduced until the > system recovers. This is related to what the {{CommitRateLimiter}} wanted to > achieve > by monitoring observation queues. > The design of the {{CommitRateLimiter}} could be very efficient, if it only > know _which_ commits to delay. But it does not know the application that oak > runs in. I propose replacing the Limiter by a {{PacingService}} that can be > provided by the application using oak. The service will get the data about > the current commit, queue length and limits. Whatever else it does remains > opaque. It may raise a proper exception to indicate that the commit shall > fail. But mostly, it is expected to delay those commits that would negatively > affect system stability. > h3. An Example > In a proof of concept, an AEM system was blasted with endless uploads on > multiple connections in order to eventually overwhelm queues. The a pacing > was patched into oak-core that delayed commits from servlet requests and from > certain workflows for some milliseconds until the queue length shrank again. > The pacing had a maximum wait time that would make the commit fail. > The pacer was configured to trigger at 75% of maximum queue length and the > system was blasted with uploads again. In the tests: > # the max queue length stayed under 80% > # no upload did reach the maximum time, all succeeded > The system adapted the external load to its capabilities successfully. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)