[
https://issues.apache.org/jira/browse/SLING-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stefan Egli resolved SLING-5302.
--------------------------------
Resolution: Fixed
doing {{issueHeartbeat}} calls by HeartbeatTest synchronized now: rev 1714628
> HeartbeatTest.testSlowAndFastMachine fails due to unsynchronized issuing of a
> heartbeat
> ---------------------------------------------------------------------------------------
>
> Key: SLING-5302
> URL: https://issues.apache.org/jira/browse/SLING-5302
> Project: Sling
> Issue Type: Test
> Components: Extensions
> Affects Versions: Discovery Impl 1.2.2
> Reporter: Stefan Egli
> Assignee: Stefan Egli
> Priority: Minor
> Fix For: Discovery Impl 1.2.4
>
>
> The following grep-extract of the log file of
> *
> https://builds.apache.org/job/sling-trunk-1.7/org.apache.sling$org.apache.sling.discovery.impl/2721/testReport/org.apache.sling.discovery.impl.common.heartbeat/HeartbeatTest/testSlowAndFastMachine/
> which was one of similar failing tests:
> *
> https://builds.apache.org/job/sling-trunk-1.8/org.apache.sling$org.apache.sling.discovery.impl/1998/testReport/org.apache.sling.discovery.impl.common.heartbeat/HeartbeatTest/testSlowAndFastMachine/
> *
> https://builds.apache.org/job/sling-trunk-1.8/org.apache.sling$org.apache.sling.discovery.impl/2004/testReport/org.apache.sling.discovery.impl.common.heartbeat/HeartbeatTest/testSlowAndFastMachine/
> shows that due to the unsynchronized call to {{issueHeartbeat}} - which later
> calls {{issueClusterLocalHeartbeat}} - can result in a race-condition with a
> background job and cause a false-positive from the duplicate sling Id
> detection. The background job is in this case started by
> {{triggerAsyncConnectorPing}} (which was only re-activated recently, so this
> failure pattern is rather recent):
> {code}
> 13.11.2015 10:44:07.000 *DEBUG* [main] HeartbeatHandler:
> issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for
> d27b4ec4-6530-445d-aa60-cb015611205e
> 13.11.2015 10:44:07.397 *DEBUG* [main] HeartbeatHandler:
> issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for
> 332ef30b-33df-488f-807a-456d4b773812
> 13.11.2015 10:44:07.455 *INFO * [Discovery-AsyncEventSender]
> DiscoveryServiceImpl: changePropagationListener.handleTopologyEvent: topology
> changed - propagate through connectors
> 13.11.2015 10:44:07.455 *INFO * [Discovery-AsyncEventSender]
> HeartbeatHandler: triggerAsyncConnectorPing: firing job to trigger heartbeat
> 13.11.2015 10:44:07.456 *INFO * [Discovery-AsyncEventSender]
> HeartbeatTest$SimpleTopologyEventListener: handleTopologyEvent[fast]:
> TOPOLOGY_CHANGED, newView contains 2, newView.isCurrent=true
> 13.11.2015 10:44:07.456 *DEBUG* [pool-1-thread-2] HeartbeatHandler:
> heartbeatAndCheckView: start. [for
> slingId=d27b4ec4-6530-445d-aa60-cb015611205e]
> 13.11.2015 10:44:07.467 *INFO * [Discovery-AsyncEventSender]
> DiscoveryServiceImpl: changePropagationListener.handleTopologyEvent: topology
> changed - propagate through connectors
> 13.11.2015 10:44:07.467 *INFO * [Discovery-AsyncEventSender]
> HeartbeatHandler: triggerAsyncConnectorPing: firing job to trigger heartbeat
> 13.11.2015 10:44:07.467 *DEBUG* [pool-1-thread-2] HeartbeatHandler:
> issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for
> d27b4ec4-6530-445d-aa60-cb015611205e
> 13.11.2015 10:44:07.467 *DEBUG* [pool-1-thread-1] HeartbeatHandler:
> heartbeatAndCheckView: start. [for
> slingId=332ef30b-33df-488f-807a-456d4b773812]
> 13.11.2015 10:44:07.467 *INFO * [Discovery-AsyncEventSender]
> HeartbeatTest$SimpleTopologyEventListener: handleTopologyEvent[slow]:
> TOPOLOGY_CHANGED, newView contains 2, newView.isCurrent=true
> 13.11.2015 10:44:07.469 *DEBUG* [pool-1-thread-1] HeartbeatHandler:
> issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for
> 332ef30b-33df-488f-807a-456d4b773812
> 13.11.2015 10:44:07.469 *INFO * [Discovery-AsyncEventSender]
> HeartbeatTest$SimpleTopologyEventListener: handleTopologyEvent[slow]:
> TOPOLOGY_CHANGING, oldView contained 2, oldView.isCurrent=false
> 13.11.2015 10:44:07.469 *ERROR* [pool-1-thread-1] HeartbeatHandler:
> issueClusterLocalHeartbeat: SLING-2892: Detected unexpected, concurrent
> update of:
> /var/discovery/impl/clusterInstances/332ef30b-33df-488f-807a-456d4b773812
> 'lastHeartbeat'. If not done manually, this likely indicates that there is
> more than 1 instance running in this cluster with the same sling.id. My
> sling.id is 332ef30b-33df-488f-807a-456d4b773812. Check for sling.id.file in
> your installation of all instances in this cluster to verify this! Duplicate
> sling.ids are not allowed within a cluster!
> 13.11.2015 10:44:08.528 *DEBUG* [main] HeartbeatHandler:
> issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for
> d27b4ec4-6530-445d-aa60-cb015611205e
> 13.11.2015 10:44:08.550 *DEBUG* [main] HeartbeatHandler:
> issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for
> 332ef30b-33df-488f-807a-456d4b773812
> 13.11.2015 10:44:08.577 *INFO * [Discovery-AsyncEventSender]
> DiscoveryServiceImpl: changePropagationListener.handleTopologyEvent: topology
> changed - propagate through connectors
> 13.11.2015 10:44:08.577 *INFO * [Discovery-AsyncEventSender]
> HeartbeatHandler: triggerAsyncConnectorPing: firing job to trigger heartbeat
> 13.11.2015 10:44:08.577 *INFO * [Discovery-AsyncEventSender]
> HeartbeatTest$SimpleTopologyEventListener: handleTopologyEvent[slow]:
> TOPOLOGY_CHANGED, newView contains 2, newView.isCurrent=true
> 13.11.2015 10:44:08.577 *DEBUG* [pool-1-thread-5] HeartbeatHandler:
> heartbeatAndCheckView: start. [for
> slingId=332ef30b-33df-488f-807a-456d4b773812]
> 13.11.2015 10:44:08.592 *DEBUG* [pool-1-thread-5] HeartbeatHandler:
> issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for
> 332ef30b-33df-488f-807a-456d4b773812
> 13.11.2015 10:44:09.577 *DEBUG* [main] HeartbeatHandler:
> issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for
> d27b4ec4-6530-445d-aa60-cb015611205e
> 13.11.2015 10:44:09.610 *DEBUG* [main] HeartbeatHandler:
> issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for
> 332ef30b-33df-488f-807a-456d4b773812
> 13.11.2015 10:44:10.622 *INFO * [main] HeartbeatTest:
> doTestSlowAndFastMachine: by now the two should have joined
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)