Aled Sage created BROOKLYN-560:
----------------------------------

             Summary: MachineEntity rebind sometimes fails to add machine 
metrics feed
                 Key: BROOKLYN-560
                 URL: https://issues.apache.org/jira/browse/BROOKLYN-560
             Project: Brooklyn
          Issue Type: Bug
            Reporter: Aled Sage


{{MachineEntityJcloudsRebindTest.testRebind}} fails non-deterministically in 
1.0.0-SNAPSHOT:
{noformat}
2017-11-10 21:22:12,481 INFO  TESTNG FAILED: "Surefire test" - 
org.apache.brooklyn.entity.machine.MachineEntityJcloudsRebindTest.testRebind() 
finished in 30882 ms
java.lang.AssertionError: failed succeeds-eventually, 75 attempts, 30001ms 
elapsed: AssertionError: Commands (/etc/os-release) not contain in 
[ExecCmd{...},...]
        at 
org.apache.brooklyn.entity.machine.MachineEntityJcloudsRebindTest.assertRecordedSshCmdContainsEventually(MachineEntityJcloudsRebindTest.java:117)
        at 
org.apache.brooklyn.entity.machine.MachineEntityJcloudsRebindTest.testRebind(MachineEntityJcloudsRebindTest.java:103)
Caused by: java.lang.AssertionError: 
        at 
org.apache.brooklyn.entity.machine.MachineEntityJcloudsRebindTest.assertRecordedSshCmdContainsEventually(MachineEntityJcloudsRebindTest.java:117)
        at 
org.apache.brooklyn.entity.machine.MachineEntityJcloudsRebindTest.testRebind(MachineEntityJcloudsRebindTest.java:103)
{noformat}


To word it in terms of a bug in production code, when you rebind to a 
{{MachineEntity}} then it sometimes fails to add the {{machineMetricsFeed}} 
feed onto the entity again, which means it will not update sensors like 
{{machine.loadAverage}} or {{machine.cpu}}.

The problem is that {{SoftwareProcessImpl.callRebindHooks()}} schedules a task 
to call {{connectSensors}} some random time within the next 10 seconds. 
However, if this executes very soon (before the current thread has got far 
enough with the rest of rebind), then when connectSensors calls 
{{JcloudsSshMachineLocation.inferMachineDetails}} it fails: it sees the 
{{JcloudsSshMachineLocation.isManaged()}} as still false. It therefore creates 
an empty BasicOsDetails, rather than executing the {{os-details.sh}} script.

You can make this test fail consistently if you change the 
{{MachineEntity.MAXIMUM_REBIND_SENSOR_CONNECT_DELAY}} to {{0}} (rather than the 
{{100ms}} that is set in the test).

The stacktrace of the job scheduled by {{SoftwareProcessImpl.callRebindHooks}} 
is shown below:
{noformat}
Daemon Thread [brooklyn-execmanager-Vv9IUPme-1] (Suspended)     
        owns: Object  (id=571)  
        JcloudsSshMachineLocation.inferMachineDetails() line: 533       
        JcloudsSshMachineLocation(SshMachineLocation).getMachineDetails() line: 
1037    
        JcloudsSshMachineLocation(SshMachineLocation).getOsDetails() line: 1018 
        MachineEntityImpl.connectSensors() line: 58     
        SoftwareProcessImpl$2.call() line: 402  
        SoftwareProcessImpl$2.call() line: 1    
        BasicExecutionManager$ScheduledTaskCallable$1.call() line: 476  
        BasicExecutionManager$SubmissionCallable<T>.call() line: 565    
        FutureTask<V>.run() line: 266   
        ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1142      
        ThreadPoolExecutor$Worker.run() line: 617       
        Thread.run() line: 745  
{noformat}

Note this relates to https://issues.apache.org/jira/browse/BROOKLYN-425, which 
reported similar symptoms of the feed to being registered, but where it was 
happening all the time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to