[ 
https://issues.apache.org/jira/browse/SLING-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Egli closed SLING-10353.
-------------------------------

> ClusterSyncServiceChain can leak a background thread
> ----------------------------------------------------
>
>                 Key: SLING-10353
>                 URL: https://issues.apache.org/jira/browse/SLING-10353
>             Project: Sling
>          Issue Type: Bug
>          Components: Discovery
>    Affects Versions: Discovery Commons 1.0.20
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>            Priority: Major
>             Fix For: Discovery Commons 1.0.24
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Invoking a {{ClusterSyncService.sync()}} should ensure that if a background 
> thread is running, it gets terminated. In the case of a chain of sync 
> services implemented by {{ClusterSyncServiceChain}} there is a race-condition 
> which can leak a background thread running unintentionally.
> The order of events is eg:
> * setup: ClusterSyncServiceChain is configured with a chain of a 
> OakBacklogClusterSyncService followed by a SyncTokenService
> * ViewStateManagerImpl.handleNewViewNonDelayed can be called consecutively, 
> as it is protected by a lock. Let's assume on first invocation it starts a 
> sync process by invoking the ClusterSyncServiceChain.sync()
> * the ClusterSyncServiceChain.sync() will first invoke the 
> OakBacklogClusterSyncService.sync()
> * let's assume there is a backlog (ie a recovery hasn't finished yet) with a 
> particular instance in the cluster - so OakBacklogClusterSyncService.sync() 
> will spawn a background thread that checks for that backlog every 2 seconds 
> (handled by AbstractServiceWithBackgroundCheck.BackgroundCheckRunnable.run()).
> * the OakBacklogClusterSyncService background check determines that the 
> backlog is gone and that it can proceed. It will do so by continuing in the 
> ClusterSyncServiceChain - ie it will invoke the SyncTokenService.sync() - 
> that in turn invokes syncToken() - which invokes startBackgroundThread. At 
> this point, let's assume, the check() returns false - ie there are sync 
> tokens it needs to wait for. Hence it will start a new background Thread for 
> the SyncTokenService.
> * let's now assume, at some later point 
> ViewStateManagerImpl.handleNewViewNonDelayed is again invoked. So it will 
> again call ClusterSyncServiceChain.sync(). Now it is the responsibility of 
> ClusterSyncServiceChain.sync() to cancel any currently running background 
> thread - but  the first element in the chain (the 
> OakBacklogClusterSyncService) doesn't actually have any background thread 
> anymore - it has already "handed over" to the SyncTokenService - so the 
> OakBacklogClusterSyncService.cancelPreviousBackgroundCheck() will not have 
> anything to cancel. Hence it goes ahead and follows the sync() chain process 
> - which is first checking for backlog (that has cleared now, as we assumed), 
> and then the sync token (which is not complete, as we assumed). So it will 
> now create a *second* background thread.
> The above resulting in 2 SyncTokenService background threads.
> This can have a few consequences:
> * a behaviour was noticed, whereby {{SyncTokenService.storeMySyncToken}} 
> keeps being invoked with toggling values, eg first it stores {{627}}, then 
> {{628}}, then again {{627}}, then {{628}} etc. The following is an example of 
> the generated data in DocumentNodeStore as a result ( excerpt from the 
> {{"_id" : "4:/var/discovery/oak/syncTokens"}} document ) :
> {noformat}
>       "<slingId>" : {
>               "r179378d10da-0-4" : "\"627\"",
>               "r179378d14ce-0-4" : "\"628\"",
>               "r179378d18bc-0-4" : "\"627\"",
>               "r179378d1ca1-0-4" : "\"628\"",
>               "r179378d209c-0-4" : "\"627\"",
>               "r179378d2477-0-4" : "\"628\"",
>               "r179378d287f-0-4" : "\"627\"",
>               "r179378d2c4b-0-4" : "\"628\"",
>               "r179378d3068-0-4" : "\"627\"",
>               "r179378d3421-0-4" : "\"628\""
>       },
> {noformat}
> This eventually should clear out - but it can lead to 
> ** unnecessary lengthening the sync procedure as other cluster nodes will 
> potentially have to wait longer
> ** confusion when reading such log files
> * it can also result in conflict exceptions if the two threads want to update 
> simultaneously:
> {noformat}
> 04.05.2021 13:22:27.741 *ERROR* [Discovery-AsyncEventSender] 
> org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService 
> storeMySyncToken: got PersistenceException while storing my syncToken:
>  org.apache.sling.api.resource.PersistenceException: Unable to commit changes 
> to session.
> org.apache.sling.api.resource.PersistenceException: Unable to commit changes 
> to session.
>         at 
> org.apache.sling.jcr.resource.internal.helper.jcr.JcrResourceProvider.commit(JcrResourceProvider.java:516)
>  [org.apache.sling.jcr.resource:3.0.22]
>         at 
> org.apache.sling.resourceresolver.impl.providers.stateful.AuthenticatedResourceProvider.commit(AuthenticatedResourceProvider.java:215)
>  [org.apache.sling.resourceresolver:1.7.2]
>         at 
> org.apache.sling.resourceresolver.impl.helper.ResourceResolverControl.commit(ResourceResolverControl.java:424)
>  [org.apache.sling.resourceresolver:1.7.2]
>         at 
> org.apache.sling.resourceresolver.impl.ResourceResolverImpl.commit(ResourceResolverImpl.java:1000)
>  [org.apache.sling.resourceresolver:1.7.2]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.storeMySyncToken(SyncTokenService.java:174)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.access$000(SyncTokenService.java:53)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService$1.check(SyncTokenService.java:137)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck.startBackgroundCheck(AbstractServiceWithBackgroundCheck.java:199)
>  [org.apache.sling.discovery.commons:
> 1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.syncToken(SyncTokenService.java:131)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.SyncTokenService.sync(SyncTokenService.java:124)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.chainedSync(ClusterSyncServiceChain.java:65)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.access$000(ClusterSyncServiceChain.java:34)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain$1.run(ClusterSyncServiceChain.java:69)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck.startBackgroundCheck(AbstractServiceWithBackgroundCheck.java:204)
>  [org.apache.sling.discovery.commons:
> 1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.OakBacklogClusterSyncService.waitWhileBacklog(OakBacklogClusterSyncService.java:142)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.OakBacklogClusterSyncService.sync(OakBacklogClusterSyncService.java:136)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.chainedSync(ClusterSyncServiceChain.java:65)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.ClusterSyncServiceChain.sync(ClusterSyncServiceChain.java:54)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.base.ViewStateManagerImpl$1.trigger(ViewStateManagerImpl.java:548)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at 
> org.apache.sling.discovery.commons.providers.base.AsyncEventSender.run(AsyncEventSender.java:118)
>  [org.apache.sling.discovery.commons:1.0.20]
>         at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: javax.jcr.InvalidItemStateException: OakState0001: Unresolved 
> conflicts in /var/discovery/oak/syncTokens
>         at 
> org.apache.jackrabbit.oak.api.CommitFailedException.asRepositoryException(CommitFailedException.java:238)
> ...
> {noformat}
> * the 2 threads can also be seen in a thread-dump of course:
> {noformat}
> "SyncTokenService" #504 daemon prio=5 os_prio=0 cpu=258.15ms elapsed=47.01s 
> tid=0x000000000f315800 nid=0x449c in Object.wait()  [0x00007f74fd56c000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait([email protected]/Native Method)
>         - waiting on <no object reference available>
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck$BackgroundCheckRunnable.run(AbstractServiceWithBackgroundCheck.java:98)
>         - waiting to re-lock in wait() <0x00000007a8516420> (a 
> java.lang.Object)
>         at java.lang.Thread.run([email protected]/Thread.java:834)
> "SyncTokenService" #530 daemon prio=5 os_prio=0 cpu=0.78ms elapsed=0.96s 
> tid=0x000000000a638800 nid=0x4870 in Object.wait()  [0x00007f74f0bdf000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait([email protected]/Native Method)
>         - waiting on <no object reference available>
>         at 
> org.apache.sling.discovery.commons.providers.spi.base.AbstractServiceWithBackgroundCheck$BackgroundCheckRunnable.run(AbstractServiceWithBackgroundCheck.java:98)
>         - waiting to re-lock in wait() <0x00000007bfe29640> (a 
> java.lang.Object)
>         at java.lang.Thread.run([email protected]/Thread.java:834)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to