[ 
https://issues.apache.org/jira/browse/BROOKLYN-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16362071#comment-16362071
 ] 

ASF GitHub Bot commented on BROOKLYN-580:
-----------------------------------------

Github user aledsage commented on a diff in the pull request:

    https://github.com/apache/brooklyn-server/pull/947#discussion_r167810983
  
    --- Diff: 
software/base/src/main/java/org/apache/brooklyn/entity/software/base/SoftwareProcessImpl.java
 ---
    @@ -385,33 +384,76 @@ protected void callRebindHooks() {
                 connectSensors();
             } else {
                 long delay = (long) (Math.random() * 
configuredMaxDelay.toMilliseconds());
    -            LOG.debug("Scheduled reconnection of sensors on {} in {}ms", 
this, delay);
    +            LOG.debug("Scheduling reconnection of sensors on {} in {}ms", 
this, delay);
                 
    -            // This is functionally equivalent to new 
scheduledExecutor.schedule(job, delay, TimeUnit.MILLISECONDS).
    -            // It uses the entity's execution context to schedule and thus 
execute the job.
    -            Map<?,?> flags = MutableMap.of("delay", 
Duration.millis(delay), "maxIterations", 1, "cancelOnException", true);
    -            Callable<Void> job = new Callable<Void>() {
    -                public Void call() {
    -                    try {
    -                        if (getManagementSupport().isNoLongerManaged()) {
    -                            LOG.debug("Entity {} no longer managed; 
ignoring scheduled connect sensors on rebind", SoftwareProcessImpl.this);
    -                            return null;
    -                        }
    -                        connectSensors();
    -                    } catch (Throwable e) {
    -                        LOG.warn("Problem connecting sensors on rebind of 
"+SoftwareProcessImpl.this, e);
    -                        Exceptions.propagateIfFatal(e);
    -                    }
    -                    return null;
    -                }};
    -            ScheduledTask scheduledTask = new ScheduledTask(flags, new 
BasicTask<Void>(job));
    -            getExecutionContext().submit(scheduledTask);
    +            scheduleConnectSensorsOnRebind(Duration.millis(delay));
             }
    +        
             // don't wait here - it may be long-running, e.g. if remote entity 
has died, and we don't want to block rebind waiting or cause it to fail
             // the service will subsequently show service not up and thus 
failure
     //        waitForServiceUp();
         }
     
    +    protected void scheduleConnectSensorsOnRebind(Duration delay) {
    +        Callable<Void> job = new Callable<Void>() {
    +            public Void call() {
    +                try {
    +                    if (!getManagementContext().isRunning()) {
    +                        LOG.debug("Management context not running; entity 
{} ignoring scheduled connect-sensors on rebind", SoftwareProcessImpl.this);
    +                        return null;
    +                    }
    +                    if (getManagementSupport().isNoLongerManaged()) {
    +                        LOG.debug("Entity {} no longer managed; ignoring 
scheduled connect-sensors on rebind", SoftwareProcessImpl.this);
    +                        return null;
    +                    }
    +                    
    +                    // Don't call connectSensors until the entity is 
actually managed.
    +                    // See 
https://issues.apache.org/jira/browse/BROOKLYN-580
    +                    boolean rebindActive = 
getManagementContext().getRebindManager().isRebindActive();
    +                    if (!getManagementSupport().wasDeployed()) {
    +                        if (rebindActive) {
    +                            // We are still rebinding, and entity not yet 
managed - reschedule.
    +                            Duration configuredMaxDelay = 
getConfig(MAXIMUM_REBIND_SENSOR_CONNECT_DELAY);
    +                            if (configuredMaxDelay == null) 
configuredMaxDelay = Duration.millis(100);
    +                            long delay = (long) (Math.random() * 
configuredMaxDelay.toMilliseconds());
    +                            delay = Math.max(10, delay);
    +                            LOG.debug("Entity {} not yet managed; 
re-scheduling connect-sensors on rebind in {}ms", SoftwareProcessImpl.this, 
delay);
    +                            
    +                            
scheduleConnectSensorsOnRebind(Duration.millis(delay));
    --- End diff --
    
    Good question. I think it's safe because `rebindActive` is cleared in a 
finally block 
(https://github.com/apache/brooklyn-server/blob/master/core/src/main/java/org/apache/brooklyn/core/mgmt/rebind/RebindIteration.java#L302).
    
    Were you thinking of a circuit breaker based on a max time (given there are 
already "circuit breakers" based on rebind having finished, etc)?
    
    I don't really like setting max times because it's hard to know what a 
sensible limit is. For example, when I was load testing Brooklyn with 10s of 
thousands of entities then rebind took many minutes - but it did eventually 
work.
    
    Therefore I'm going to merge this. But if you think a time-based circuit 
breaker is still worth it, or another kind of circuit breaker, then please do 
let me know.


> Rebinding to MachineEntity: sometimes fails to reconnect sensor feeds
> ---------------------------------------------------------------------
>
>                 Key: BROOKLYN-580
>                 URL: https://issues.apache.org/jira/browse/BROOKLYN-580
>             Project: Brooklyn
>          Issue Type: Bug
>    Affects Versions: 0.12.0
>            Reporter: Aled Sage
>            Priority: Major
>
> On rebind, sometimes \{{MachineEntity}} instances do not have their feeds 
> recreated. This is illustrated by non-deterministic test failure in 
> \{{MachineEntityJcloudsRebindTest}}.
> The problem is that \{{SoftwareProcessImpl.callRebindHooks}} schedules a task 
> to call \{{connectSensors}} in something between 0 and 10 seconds time, which 
> will try to recreate the feeds. However, if this executes too soon (while 
> rebind is still happening), the \{{SshMachineLocation}} may not yet be 
> managed. If that is the case, the feed is not created.
> This is most likely to happen if there are a lot of entities/locations, so 
> iterating over them for rebind takes longer. It is random in that the delay 
> in calling \{{connectSensors}} can sometimes be extremely short (the 
> randomness there is to avoid the thundering herd problem on rebind).
> Although the symptoms are similar to 
> https://issues.apache.org/jira/browse/BROOKLYN-425, the underlying cause is 
> different - therefore treating this as a new issue rather than reopening the 
> old one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to