[
https://issues.apache.org/jira/browse/SLING-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stefan Egli closed SLING-10353.
-------------------------------
> ClusterSyncServiceChain can leak a background thread
> ----------------------------------------------------
>
> Key: SLING-10353
> URL: https://issues.apache.org/jira/browse/SLING-10353
> Project: Sling
> Issue Type: Bug
> Components: Discovery
> Affects Versions: Discovery Commons 1.0.20
> Reporter: Stefan Egli
> Assignee: Stefan Egli
> Priority: Major
> Fix For: Discovery Commons 1.0.24
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Invoking a {{ClusterSyncService.sync()}} should ensure that if a background
> thread is running, it gets terminated. In the case of a chain of sync
> services implemented by {{ClusterSyncServiceChain}} there is a race-condition
> which can leak a background thread running unintentionally.
> The order of events is eg:
> * setup: ClusterSyncServiceChain is configured with a chain of a
> OakBacklogClusterSyncService followed by a SyncTokenService
> * ViewStateManagerImpl.handleNewViewNonDelayed can be called consecutively,
> as it is protected by a lock. Let's assume on first invocation it starts a
> sync process by invoking the ClusterSyncServiceChain.sync()
> * the ClusterSyncServiceChain.sync() will first invoke the
> OakBacklogClusterSyncService.sync()
> * let's assume there is a backlog (ie a recovery hasn't finished yet) with a
> particular instance in the cluster - so OakBacklogClusterSyncService.sync()
> will spawn a background thread that checks for that backlog every 2 seconds
> (handled by AbstractServiceWithBackgroundCheck.BackgroundCheckRunnable.run()).
> * the OakBacklogClusterSyncService background check determines that the
> backlog is gone and that it can proceed. It will do so by continuing in the
> ClusterSyncServiceChain - ie it will invoke the SyncTokenService.sync() -
> that in turn invokes syncToken() - which invokes startBackgroundThread. At
> this point, let's assume, the check() returns false - ie there are sync
> tokens it needs to wait for. Hence it will start a new background Thread for
> the SyncTokenService.
> * let's now assume, at some later point
> ViewStateManagerImpl.handleNewViewNonDelayed is again invoked. So it will
> again call ClusterSyncServiceChain.sync(). Now it is the responsibility of
> ClusterSyncServiceChain.sync() to cancel any currently running background
> thread - but the first element in the chain (the
> OakBacklogClusterSyncService) doesn't actually have any background thread
> anymore - it has already "handed over" to the SyncTokenService - so the
> OakBacklogClusterSyncService.cancelPreviousBackgroundCheck() will not have
> anything to cancel. Hence it goes ahead and follows the sync() chain process
> - which is first checking for backlog (that has cleared now, as we assumed),
> and then the sync token (which is not complete, as we assumed). So it will
> now create a *second* background thread.
> The above resulting in 2 SyncTokenService background threads.
> This can have a few consequences:
> * a behaviour was noticed, whereby {{SyncTokenService.storeMySyncToken}}
> keeps being invoked with toggling values, eg first it stores {{627}}, then
> {{628}}, then again {{627}}, then {{628}} etc. The following is an example of
> the generated data in DocumentNodeStore as a result ( excerpt from the
> {{"_id" : "4:/var/discovery/oak/syncTokens"}} document ) :
> {noformat}
> "<slingId>" : {
> "r179378d10da-0-4" : "\"627\"",
> "r179378d14ce-0-4" : "\"628\"",
> "r179378d18bc-0-4" : "\"627\"",
> "r179378d1ca1-0-4" : "\"628\"",
> "r179378d209c-0-4" : "\"627\"",
> "r179378d2477-0-4" : "\"628\"",
> "r179378d287f-0-4" : "\"627\"",
> "r179378d2c4b-0-4" : "\"628\"",
> "r179378d3068-0-4" : "\"627\"",
> "r179378d3421-0-4" : "\"628\""
> },
> {noformat}
> This eventually should clear out - but it can lead to
> ** unnecessary lengthening the sync procedure as other cluster nodes will
> potentially have to wait longer
> ** confusion when reading such log files
> * it can also result in conflict exceptions if the two threads want to update
> simultaneously:
> {noformat}
> 04.05.2021 13:22:27.741 *ERROR* [Discovery-AsyncEventSender]
> org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService
> storeMySyncToken: got PersistenceException while storing my syncToken:
> org.apache.sling.api.resource.PersistenceException: Unable to commit changes
> to session.
> org.apache.sling.api.resource.PersistenceException: Unable to commit changes
> to session.
> at
> org.apache.sling.jcr.resource.internal.helper.jcr.JcrResourceProvider.commit(JcrResourceProvider.java:516)
> [org.apache.sling.jcr.resource:3.0.22]
> at
> org.apache.sling.resourceresolver.impl.providers.stateful.AuthenticatedResourceProvider.commit(AuthenticatedResourceProvider.java:215)
> [org.apache.sling.resourceresolver:1.7.2]
> at
> org.apache.sling.resourceresolver.impl.helper.ResourceResolverControl.commit(ResourceResolverControl.java:424)
> [org.apache.sling.resourceresolver:1.7.2]
> at
> org.apache.sling.resourceresolver.impl.ResourceResolverImpl.commit(ResourceResolverImpl.java:1000)
> [org.apache.sling.resourceresolver:1.7.2]
> at
> org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.storeMySyncToken(SyncTokenService.java:174)
> [org.apache.sling.discovery.commons:1.0.20]
> at
> org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.access$000(SyncTokenService.java:53)
> [org.apache.sling.discovery.commons:1.0.20]
> at
> org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService$1.check(SyncTokenService.java:137)
> [org.apache.sling.discovery.commons:1.0.20]
> at
> org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck.startBackgroundCheck(AbstractServiceWithBackgroundCheck.java:199)
> [org.apache.sling.discovery.commons:
> 1.0.20]
> at
> org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.syncToken(SyncTokenService.java:131)
> [org.apache.sling.discovery.commons:1.0.20]
> at
> org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.sync(SyncTokenService.java:124)
> [org.apache.sling.discovery.commons:1.0.20]
> at
> org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.chainedSync(ClusterSyncServiceChain.java:65)
> [org.apache.sling.discovery.commons:1.0.20]
> at
> org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.access$000(ClusterSyncServiceChain.java:34)
> [org.apache.sling.discovery.commons:1.0.20]
> at
> org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain$1.run(ClusterSyncServiceChain.java:69)
> [org.apache.sling.discovery.commons:1.0.20]
> at
> org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck.startBackgroundCheck(AbstractServiceWithBackgroundCheck.java:204)
> [org.apache.sling.discovery.commons:
> 1.0.20]
> at
> org.apache.sling.discovery.commons.providers.spi.base.OakBacklogClusterSyncService.waitWhileBacklog(OakBacklogClusterSyncService.java:142)
> [org.apache.sling.discovery.commons:1.0.20]
> at
> org.apache.sling.discovery.commons.providers.spi.base.OakBacklogClusterSyncService.sync(OakBacklogClusterSyncService.java:136)
> [org.apache.sling.discovery.commons:1.0.20]
> at
> org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.chainedSync(ClusterSyncServiceChain.java:65)
> [org.apache.sling.discovery.commons:1.0.20]
> at
> org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.sync(ClusterSyncServiceChain.java:54)
> [org.apache.sling.discovery.commons:1.0.20]
> at
> org.apache.sling.discovery.commons.providers.base.ViewStateManagerImpl$1.trigger(ViewStateManagerImpl.java:548)
> [org.apache.sling.discovery.commons:1.0.20]
> at
> org.apache.sling.discovery.commons.providers.base.AsyncEventSender.run(AsyncEventSender.java:118)
> [org.apache.sling.discovery.commons:1.0.20]
> at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: javax.jcr.InvalidItemStateException: OakState0001: Unresolved
> conflicts in /var/discovery/oak/syncTokens
> at
> org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:238)
> ...
> {noformat}
> * the 2 threads can also be seen in a thread-dump of course:
> {noformat}
> "SyncTokenService" #504 daemon prio=5 os_prio=0 cpu=258.15ms elapsed=47.01s
> tid=0x000000000f315800 nid=0x449c in Object.wait() [0x00007f74fd56c000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait([email protected]/Native Method)
> - waiting on <no object reference available>
> at
> org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck$BackgroundCheckRunnable.run(AbstractServiceWithBackgroundCheck.java:98)
> - waiting to re-lock in wait() <0x00000007a8516420> (a
> java.lang.Object)
> at java.lang.Thread.run([email protected]/Thread.java:834)
> "SyncTokenService" #530 daemon prio=5 os_prio=0 cpu=0.78ms elapsed=0.96s
> tid=0x000000000a638800 nid=0x4870 in Object.wait() [0x00007f74f0bdf000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait([email protected]/Native Method)
> - waiting on <no object reference available>
> at
> org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck$BackgroundCheckRunnable.run(AbstractServiceWithBackgroundCheck.java:98)
> - waiting to re-lock in wait() <0x00000007bfe29640> (a
> java.lang.Object)
> at java.lang.Thread.run([email protected]/Thread.java:834)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)