Stefan Egli created SLING-5302:
----------------------------------

             Summary: HeartbeatTest.testSlowAndFastMachine fails due to 
unsynchronized issuing of a heartbeat
                 Key: SLING-5302
                 URL: https://issues.apache.org/jira/browse/SLING-5302
             Project: Sling
          Issue Type: Test
          Components: Extensions
    Affects Versions: Discovery Impl 1.2.2
            Reporter: Stefan Egli
            Assignee: Stefan Egli
            Priority: Minor
             Fix For: Discovery Impl 1.2.4


The following grep-extract of the log file of
* 
https://builds.apache.org/job/sling-trunk-1.7/org.apache.sling$org.apache.sling.discovery.impl/2721/testReport/org.apache.sling.discovery.impl.common.heartbeat/HeartbeatTest/testSlowAndFastMachine/

which was one of similar failing tests:
* 
https://builds.apache.org/job/sling-trunk-1.8/org.apache.sling$org.apache.sling.discovery.impl/1998/testReport/org.apache.sling.discovery.impl.common.heartbeat/HeartbeatTest/testSlowAndFastMachine/
* 
https://builds.apache.org/job/sling-trunk-1.8/org.apache.sling$org.apache.sling.discovery.impl/2004/testReport/org.apache.sling.discovery.impl.common.heartbeat/HeartbeatTest/testSlowAndFastMachine/

shows that due to the unsynchronized call to {{issueHeartbeat}} - which later 
calls {{issueClusterLocalHeartbeat}} - can result in a race-condition with a 
background job and cause a false-positive from the duplicate sling Id 
detection. The background job is in this case started by 
{{triggerAsyncConnectorPing}} (which was only re-activated recently, so this 
failure pattern is rather recent):

{code}
13.11.2015 10:44:07.000 *DEBUG* [main] HeartbeatHandler: 
issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 
d27b4ec4-6530-445d-aa60-cb015611205e
13.11.2015 10:44:07.397 *DEBUG* [main] HeartbeatHandler: 
issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 
332ef30b-33df-488f-807a-456d4b773812
13.11.2015 10:44:07.455 *INFO * [Discovery-AsyncEventSender] 
DiscoveryServiceImpl: changePropagationListener.handleTopologyEvent: topology 
changed - propagate through connectors
13.11.2015 10:44:07.455 *INFO * [Discovery-AsyncEventSender] HeartbeatHandler: 
triggerAsyncConnectorPing: firing job to trigger heartbeat
13.11.2015 10:44:07.456 *INFO * [Discovery-AsyncEventSender] 
HeartbeatTest$SimpleTopologyEventListener: handleTopologyEvent[fast]: 
TOPOLOGY_CHANGED, newView contains 2, newView.isCurrent=true
13.11.2015 10:44:07.456 *DEBUG* [pool-1-thread-2] HeartbeatHandler: 
heartbeatAndCheckView: start. [for slingId=d27b4ec4-6530-445d-aa60-cb015611205e]
13.11.2015 10:44:07.467 *INFO * [Discovery-AsyncEventSender] 
DiscoveryServiceImpl: changePropagationListener.handleTopologyEvent: topology 
changed - propagate through connectors
13.11.2015 10:44:07.467 *INFO * [Discovery-AsyncEventSender] HeartbeatHandler: 
triggerAsyncConnectorPing: firing job to trigger heartbeat
13.11.2015 10:44:07.467 *DEBUG* [pool-1-thread-2] HeartbeatHandler: 
issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 
d27b4ec4-6530-445d-aa60-cb015611205e
13.11.2015 10:44:07.467 *DEBUG* [pool-1-thread-1] HeartbeatHandler: 
heartbeatAndCheckView: start. [for slingId=332ef30b-33df-488f-807a-456d4b773812]
13.11.2015 10:44:07.467 *INFO * [Discovery-AsyncEventSender] 
HeartbeatTest$SimpleTopologyEventListener: handleTopologyEvent[slow]: 
TOPOLOGY_CHANGED, newView contains 2, newView.isCurrent=true
13.11.2015 10:44:07.469 *DEBUG* [pool-1-thread-1] HeartbeatHandler: 
issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 
332ef30b-33df-488f-807a-456d4b773812
13.11.2015 10:44:07.469 *INFO * [Discovery-AsyncEventSender] 
HeartbeatTest$SimpleTopologyEventListener: handleTopologyEvent[slow]: 
TOPOLOGY_CHANGING, oldView contained 2, oldView.isCurrent=false
13.11.2015 10:44:07.469 *ERROR* [pool-1-thread-1] HeartbeatHandler: 
issueClusterLocalHeartbeat: SLING-2892: Detected unexpected, concurrent update 
of: /var/discovery/impl/clusterInstances/332ef30b-33df-488f-807a-456d4b773812 
'lastHeartbeat'. If not done manually, this likely indicates that there is more 
than 1 instance running in this cluster with the same sling.id. My sling.id is 
332ef30b-33df-488f-807a-456d4b773812. Check for sling.id.file in your 
installation of all instances in this cluster to verify this! Duplicate 
sling.ids are not allowed within a cluster!
13.11.2015 10:44:08.528 *DEBUG* [main] HeartbeatHandler: 
issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 
d27b4ec4-6530-445d-aa60-cb015611205e
13.11.2015 10:44:08.550 *DEBUG* [main] HeartbeatHandler: 
issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 
332ef30b-33df-488f-807a-456d4b773812
13.11.2015 10:44:08.577 *INFO * [Discovery-AsyncEventSender] 
DiscoveryServiceImpl: changePropagationListener.handleTopologyEvent: topology 
changed - propagate through connectors
13.11.2015 10:44:08.577 *INFO * [Discovery-AsyncEventSender] HeartbeatHandler: 
triggerAsyncConnectorPing: firing job to trigger heartbeat
13.11.2015 10:44:08.577 *INFO * [Discovery-AsyncEventSender] 
HeartbeatTest$SimpleTopologyEventListener: handleTopologyEvent[slow]: 
TOPOLOGY_CHANGED, newView contains 2, newView.isCurrent=true
13.11.2015 10:44:08.577 *DEBUG* [pool-1-thread-5] HeartbeatHandler: 
heartbeatAndCheckView: start. [for slingId=332ef30b-33df-488f-807a-456d4b773812]
13.11.2015 10:44:08.592 *DEBUG* [pool-1-thread-5] HeartbeatHandler: 
issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 
332ef30b-33df-488f-807a-456d4b773812
13.11.2015 10:44:09.577 *DEBUG* [main] HeartbeatHandler: 
issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 
d27b4ec4-6530-445d-aa60-cb015611205e
13.11.2015 10:44:09.610 *DEBUG* [main] HeartbeatHandler: 
issueClusterLocalHeartbeat: storing cluster-local heartbeat to repository for 
332ef30b-33df-488f-807a-456d4b773812
13.11.2015 10:44:10.622 *INFO * [main] HeartbeatTest: doTestSlowAndFastMachine: 
by now the two should have joined
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to