[ 
https://issues.apache.org/jira/browse/SLING-5280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Egli resolved SLING-5280.
--------------------------------
    Resolution: Fixed

done in rev 1713491

> HeartbeatHandler can also be blocked while holding AnnouncementRegistry lock
> ----------------------------------------------------------------------------
>
>                 Key: SLING-5280
>                 URL: https://issues.apache.org/jira/browse/SLING-5280
>             Project: Sling
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: Discovery Impl 1.2.0
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>            Priority: Critical
>             Fix For: Discovery Impl 1.2.2
>
>         Attachments: testSlowSessionSaves.fail.txt
>
>
> SLING-5195 introduces a separate thread that periodically checks if the local 
> HeartbeatHandler is properly functional and has written a heartbeat 'recently 
> enough'. If not, it marks the local topology as TOPOLOGY_CHANGING and 
> listeners are thus stepping back correctly (as the local instance is likely 
> to be treated as dead by other instances in the local cluster soon).
> Now this works fine as long as the new, separate thread can operate 
> uninterruptively. But besides the necessity that it is properly scheduled 
> regularly (which it might not, thus we should probably rather use a dedicated 
> thread than rely on the scheduler's pool to be available) it also requires 
> that the check can go ahead without running into a locked monitor.
> https://builds.apache.org/job/sling-trunk-1.8/org.apache.sling$org.apache.sling.discovery.impl/1960/testReport/org.apache.sling.discovery.impl.cluster/RepositoryDelaysTest/testSlowSessionSaves/
>  however reports a situation where the blocked commit was not in 
> {{HeartbeatHandler.issueClusterLocalHeartbeat}} - but was in 
> {{AnnouncementRegistry.checkExpiredAnnouncements}} - and that one does a 
> {{synchronized(this)}} - which in turn blocks precisely this 2nd 
> HeartbeatHandler's monitoring thread - yielding it useless..
> The following log excerpt (grepping thread-3/-4/main) proofs the above 
> described scenario:
> {code}
> 06.11.2015 23:35:22.864 *INFO * [main] RepositoryDelaysTest: <main> both 
> instances marked as delaying 1min - but with new background checks we should 
> go changing within 3sec
> 06.11.2015 23:35:22.870 *DEBUG* [pool-1-thread-4] HeartbeatHandler: 
> issueConnectorPings: not issuing remote heartbeat yet, startup not yet 
> finished
> 06.11.2015 23:35:22.870 *INFO * [pool-1-thread-4] ArtificialDelay: delay: 
> delaying [secondInstance] 'pre.commit' for 60000ms...
> 06.11.2015 23:35:22.870 *INFO * [pool-1-thread-4] ArtificialDelay: delay: 
> delaying [secondInstance] 'pre.commit' now for 60000ms...
> 06.11.2015 23:35:22.909 *INFO * [Test-ViewCheckerRunner [firstInstance]] 
> ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' for 59876ms...
> 06.11.2015 23:35:22.909 *INFO * [Test-ViewCheckerRunner [firstInstance]] 
> ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' now for 
> 59876ms...
> 06.11.2015 23:35:23.613 *DEBUG* [pool-1-thread-3] ViewHelper: 
> getEstablishedView: no established view found: 
> /var/discovery/clusterC/establishedView
> 06.11.2015 23:35:23.613 *DEBUG* [pool-1-thread-3] ClusterViewServiceImpl: 
> getClusterView: no view established at the moment. isolated mode
> 06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] BaseDiscoveryService: 
> getTopology: undefined cluster view: NO_ESTABLISHED_VIEW] 
> org.apache.sling.discovery.base.commons.UndefinedClusterViewException: no 
> established view at the moment
> 06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] MinEventDelayHandler: 
> asyncDelay.run: done delaying. new view (still/again) not current, delaying 
> again
> 06.11.2015 23:35:23.613 *TRACE* [pool-1-thread-3] MinEventDelayHandler: 
> runAfter: trying with scheduler.fireJob
> 06.11.2015 23:35:23.613 *INFO * [pool-1-thread-3] MinEventDelayHandler: 
> triggerAsyncDelaying: asynch delaying of 1 triggered: true
> 06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] ClusterViewServiceImpl: 
> getClusterView: the local instance (1b6e7530-0d41-46e7-8471-6ea7f5d3c2bc) is 
> currently not included in the existing established view! This is normal at 
> startup. At other times is pseudo-network-partitioning is an indicator for 
> repository/network-delays or clocks-out-of-sync (SLING-3432). (increasing the 
> heartbeatTimeout can help as a workaround too) The local instance will stay 
> in TOPOLOGY_CHANGING or pre _INIT mode until a new vote was successful.
> 06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] BaseDiscoveryService: 
> getTopology: undefined cluster view: ISOLATED_FROM_TOPOLOGY] 
> org.apache.sling.discovery.base.commons.UndefinedClusterViewException: 
> established view does not include local instance - isolated
> 06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] MinEventDelayHandler: 
> asyncDelay.run: done delaying. new view (still/again) not current, delaying 
> again
> 06.11.2015 23:35:23.705 *TRACE* [pool-1-thread-3] MinEventDelayHandler: 
> runAfter: trying with scheduler.fireJob
> 06.11.2015 23:35:23.705 *INFO * [pool-1-thread-3] MinEventDelayHandler: 
> triggerAsyncDelaying: asynch delaying of 1 triggered: true
> 06.11.2015 23:35:23.840 *DEBUG* [pool-1-thread-3] HeartbeatHandler: 
> checkForTopologyChange/.run: going to check for topology change...
> 06.11.2015 23:35:30.865 *DEBUG* [main] ClusterViewServiceImpl: 
> getClusterView: current establishedView is marked as invalid: 
> d5f1101c-f693-44a4-9c79-f7277230591f
> 06.11.2015 23:35:30.865 *INFO * [main] BaseDiscoveryService: getTopology: 
> undefined cluster view: NO_ESTABLISHED_VIEW] 
> org.apache.sling.discovery.base.commons.UndefinedClusterViewException: 
> current established view was marked as invalid
> 06.11.2015 23:36:22.786 *INFO * [Test-ViewCheckerRunner [firstInstance]] 
> ArtificialDelay: delay: delaying [firstInstance]'pre.commit' for 59876ms done.
> 06.11.2015 23:36:22.809 *INFO * [Test-ViewCheckerRunner [firstInstance]] 
> ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' for 59876ms...
> 06.11.2015 23:36:22.809 *INFO * [Test-ViewCheckerRunner [firstInstance]] 
> ArtificialDelay: delay: delaying [firstInstance] 'pre.commit' now for 
> 59876ms...
> 06.11.2015 23:36:22.871 *INFO * [pool-1-thread-4] ArtificialDelay: delay: 
> delaying [secondInstance]'pre.commit' for 60000ms done.
> 06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4] 
> 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb: analyzeVotings: start. slingId: 
> 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
> 06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: 
> listInstances: start. localClusterView: a ClusterView[2 instances: 
> 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
> 06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: 
> listAnnouncementsInSameCluster: start. localClusterView: a ClusterView[2 
> instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 
> 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
> 06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4] VotingHelper: listVotings: 
> votings found: 0
> 06.11.2015 23:36:22.871 *DEBUG* [pool-1-thread-4] 
> 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb: analyzeVotings: no ongoing votings at 
> the moment. done.
> 06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: 
> listAnnouncementsInSameCluster: handling clusterInstance: 
> 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5
> 06.11.2015 23:36:22.871 *DEBUG* [main] AnnouncementRegistryImpl: 
> listAnnouncementsInSameCluster: instance has announcements: 
> 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5
> 06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: 
> listAnnouncementsInSameCluster: handling clusterInstance: 
> 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
> 06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: 
> listAnnouncementsInSameCluster: matched localInstance, filling with cache: 
> 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb
> 06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler: 
> doCheckView: established view does not match. (details: old: 
> 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5, old: 
> 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb)
> 06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler: 
> doCheckViewWith: no matching established view, marking topology as changing
> 06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] HeartbeatHandler: 
> invalidateCurrentEstablishedView: invalidating 
> slingId=8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 
> lastEstablishedViewId=d5f1101c-f693-44a4-9c79-f7277230591f
> 06.11.2015 23:36:22.872 *INFO * [pool-1-thread-4] ClusterViewServiceImpl: 
> invalidateEstablishedViewId: marking established view as invalid: 
> d5f1101c-f693-44a4-9c79-f7277230591f
> 06.11.2015 23:36:22.872 *DEBUG* [pool-1-thread-4] DiscoveryServiceImpl: 
> handleTopologyChanging: invoking viewStateManager.handlechanging
> 06.11.2015 23:36:22.872 *TRACE* [pool-1-thread-4] ViewStateManagerImpl: 
> handleChanging: start
> 06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: 
> listAnnouncementsInSameCluster: result: 0
> 06.11.2015 23:36:22.872 *DEBUG* [main] AnnouncementRegistryImpl: 
> listInstances: announcements added. end. instances: []
> 06.11.2015 23:36:22.873 *DEBUG* [pool-1-thread-3] AnnouncementRegistryImpl: 
> listInstances: start. localClusterView: a ClusterView[2 instances: 
> 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
> 06.11.2015 23:36:22.873 *DEBUG* [pool-1-thread-3] AnnouncementRegistryImpl: 
> listAnnouncementsInSameCluster: start. localClusterView: a ClusterView[2 
> instances: 8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb, 
> 8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5]
> 06.11.2015 23:36:22.874 *INFO * [pool-1-thread-4] ViewStateManagerImpl: 
> enqueueForAll: sending topologyEvent TopologyEvent [type=TOPOLOGY_CHANGING, 
> oldView=DefaultTopologyView[current=false, num=2, 
> instances=8f5168bd-1bb6-4333-9b66-97cb1ae4d9d5[local=false,leader=false],8674d4b6-fe1d-4f94-b7e5-b6fba0f3affb[local=true,leader=true]],
>  newView=null], to all (0) listeners
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to