[ https://issues.apache.org/jira/browse/SOLR-13072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hoss Man reopened SOLR-13072: ----------------------------- [~ab] - if you fixed the underlying problem (and updated {{NodeMarkersRegistrationTest}} to know about the new mechanism for dealing with these markers) then is there any reason why {{NodeMarkersRegistrationTest}} still needs the {{@AwaitsFix(bugUrl="https://issues.apache.org/jira/browse/SOLR-13072")}} I added when i created this issue? Also: your commit seems to have caused {{TestSimTriggerIntegration.testNodeMarkersRegistration}} to start failing fairly reliably ... I'm guessing this is because with your changes the only thing that _should_ clear up these paths is the {{.scheduled_maintenance}} trigger's {{inactive_markers_plan}} action - but {{TestSimTriggerIntegration.setupTest()}} explicitly disables {{.scheduled_maintenance}} ... but there is also still code in OverseerTriggerThread that deals with these markers -- so frankly i'm not really sure what the "correct" behavior is... {code} log.debug("-- cleaning old nodeLost / nodeAdded markers"); removeMarkers(ZkStateReader.SOLR_AUTOSCALING_NODE_LOST_PATH); removeMarkers(ZkStateReader.SOLR_AUTOSCALING_NODE_ADDED_PATH); {code} > Management of markers for nodeLost / nodeAdded events is broken > --------------------------------------------------------------- > > Key: SOLR-13072 > URL: https://issues.apache.org/jira/browse/SOLR-13072 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: AutoScaling > Affects Versions: 7.5, 7.6, master (8.0) > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Priority: Major > Fix For: master (8.0), 7.7 > > > In order to prevent {{nodeLost}} events from being lost when it's the > Overseer leader that is the node that was lost a mechanism was added to > record markers for these events by any other live node, in > {{ZkController.registerLiveNodesListener()}}. As similar mechanism also > exists for {{nodeAdded}} events. > On Overseer leader restart if the autoscaling configuration didn't contain > any triggers that consume {{nodeLost}} events then these markers are removed. > If there are 1 or more trigger configs that consume {{nodeLost}} events then > these triggers would read the markers, remove them and generate appropriate > events. > However, as the {{NodeMarkersRegistrationTest}} shows this mechanism is > broken and susceptible to race conditions. > It's not unusual to have more than 1 {{nodeLost}} trigger because in addition > to any user-defined triggers there's always one that is automatically defined > if missing: {{.auto_add_replicas}}. However, if there's more than 1 > {{nodeLost}} trigger then the process of consuming and removing the markers > becomes non-deterministic - each trigger may pick up (and delete) all, none, > or some of the markers. > So as it is now this mechanism is broken if more than 1 {{nodeLost}} or more > than 1 {{nodeAdded}} trigger is defined. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org