Stefan Egli created SLING-4139:
----------------------------------

             Summary: regression: stale topology announcements possible after 
crash/reconfig
                 Key: SLING-4139
                 URL: https://issues.apache.org/jira/browse/SLING-4139
             Project: Sling
          Issue Type: Bug
          Components: Extensions
    Affects Versions: Discovery Impl 1.0.12
            Reporter: Stefan Egli
            Assignee: Stefan Egli
             Fix For: Discovery Impl 1.0.14


discovery.impl 1.0.4, with SLING-3389, introduced a bug whereas it got possible 
that a stale topology announcement remained in the system (and was not cleaned 
up) when a combination of crash/restart and reconfiguration/switch-over of 
topology connectors occurred.

SLING-3726 discribes one symptom of this problem, which resulted in a duplicate 
instance in the topology-tree reported by discovery.impl.

Another case where this can be reproduced is the following scenario:
 * consider 3 instances A, B and C. A and B are in the same cluster. C has a 
topology connector to A.
 * now A crashes - which leaves B and C not seeing each other through the 
topology (which is correct since the connector C-A is not possible)
 * now consider C removing the topology connector (config change) - hence C 
will see itself isolated in a topology (which is correct)
 * now consider A to restart
 ** at this point the announcement from C is still stored under 
/var/discovery/impl/clusterInstance/A/announcements/C
 ** there is a filter which only reports incoming announcements (ie A's 
announcements in this case) if the connector-client (C in this case) is really 
connected. This results in A reporting a topology which consists only of 1 
cluster containing A and B (which is correct).
 ** above mentioned filter however does not apply to B. SLING-3389 introduced 
removal of announcement-timestamps being written to the repository in order to 
reduce write-activity (which was thought of being unnecessary). Thus after 
SLING-3389 the idea is that it is A's responsibility to make sure all the 
announcements it contains (from C in this case) are current/alive/correct.
 ** now unfortunately (and that's the bug in this case) there is only a filter 
(which applies to A) but not actual removal of outdated announcements. Thus B 
will report a topology consisting of 1 cluster containing A and B - plus it 
reports C in the topology as well (as it 'sees' that through the announcement 
stored at A/announcements/C).

Hence the filter mechanism which replaced timestamps in SLING-3389 introduced a 
regression and must be replaced with a proper cleanup mechanism of outdated 
announcements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to