Stefan Egli created SLING-4139: ---------------------------------- Summary: regression: stale topology announcements possible after crash/reconfig Key: SLING-4139 URL: https://issues.apache.org/jira/browse/SLING-4139 Project: Sling Issue Type: Bug Components: Extensions Affects Versions: Discovery Impl 1.0.12 Reporter: Stefan Egli Assignee: Stefan Egli Fix For: Discovery Impl 1.0.14
discovery.impl 1.0.4, with SLING-3389, introduced a bug whereas it got possible that a stale topology announcement remained in the system (and was not cleaned up) when a combination of crash/restart and reconfiguration/switch-over of topology connectors occurred. SLING-3726 discribes one symptom of this problem, which resulted in a duplicate instance in the topology-tree reported by discovery.impl. Another case where this can be reproduced is the following scenario: * consider 3 instances A, B and C. A and B are in the same cluster. C has a topology connector to A. * now A crashes - which leaves B and C not seeing each other through the topology (which is correct since the connector C-A is not possible) * now consider C removing the topology connector (config change) - hence C will see itself isolated in a topology (which is correct) * now consider A to restart ** at this point the announcement from C is still stored under /var/discovery/impl/clusterInstance/A/announcements/C ** there is a filter which only reports incoming announcements (ie A's announcements in this case) if the connector-client (C in this case) is really connected. This results in A reporting a topology which consists only of 1 cluster containing A and B (which is correct). ** above mentioned filter however does not apply to B. SLING-3389 introduced removal of announcement-timestamps being written to the repository in order to reduce write-activity (which was thought of being unnecessary). Thus after SLING-3389 the idea is that it is A's responsibility to make sure all the announcements it contains (from C in this case) are current/alive/correct. ** now unfortunately (and that's the bug in this case) there is only a filter (which applies to A) but not actual removal of outdated announcements. Thus B will report a topology consisting of 1 cluster containing A and B - plus it reports C in the topology as well (as it 'sees' that through the announcement stored at A/announcements/C). Hence the filter mechanism which replaced timestamps in SLING-3389 introduced a regression and must be replaced with a proper cleanup mechanism of outdated announcements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)