[ https://issues.apache.org/jira/browse/SOLR-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hoss Man reassigned SOLR-13118: ------------------------------- Assignee: Hoss Man Attachment: SOLR-13118.patch In the attached patch, I've remedied this (in {{TestSimTriggerIntegration.testNodeLostTriggerRestoreState}} ) via the following (non-test) API changes: * {{SimCloudManager}} now exposes a public {{getOverseerTriggerThread()}} method – allowing Sim tests the same level of accesss to the OverseerTriggerThread/ScheduledTriggers that non-sim tests can get via the Overseer ode (by way of it's JettyRunner->CoreContainer) * {{ScheduledTriggers}} now exposes a public (lucene.internal for tests only) method to {{getTrigger(String name)}} * {{TriggerBase}} has been refactored to expose a public (lucene.internal for tests only) method to get a {{deepCopyState()}} With these changes, the test(s) can now use the following flow... * register an initial trigger w/an effectively infinite {{waitFor}} configuration * create the nodeAdd/nodeLost situation * reach into the ScheduledTriggers in a {{TimeOut.waitFor(...)}} to inspect the state of the trigger being tests to know once it's run() and detected the situation/event and started tracking it ** but not yet executed any actions because of the {{waitFor}} property * update the trigger w/ {{waitFor: 0s}} * use the latches & event queues of the "mock" trigger actions to confirm that the event state information gets preserved from one trigger instance to the next, and ultimately process()ed ---- Unless there are any concerns with this approach, I'll try to update the other similarly problematic tests ASAP... * TestSimTriggerIntegration.testNodeAddedTriggerRestoreState * NodeLostTriggerIntegrationTest.testNodeLostTriggerRestoreState * NodeAddedTriggerIntegrationTest.testNodeAddedTriggerRestoreState (I'd also like to look into re-writing TestSimTriggerIntegration.testEventFromRestoredState to use an effectively infinit {{waitFor}} and the new direct state introspection before & after restarting the overseer rather then the current (delicately problematic) precise sleep times ... but that's a lower priority at the moment that i haven't fully thought through) > Redesign integration tests for nodeAdded/nodeLost trigger state restoration > --------------------------------------------------------------------------- > > Key: SOLR-13118 > URL: https://issues.apache.org/jira/browse/SOLR-13118 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Hoss Man > Assignee: Hoss Man > Priority: Major > Attachments: SOLR-13118.patch > > > The (integration) tests related to autoscaling nodeAdd/nodeLost trigger's and > restoring their state are problematic for a lot of reasons. > Beyond some silly implementation mistakes, a fundemental timing/concurrency > issue is that (as designed) the tests have no way to ensure that "after" > creating a nodeAdded/nodeLost situation, they can wait for the (first > instance of) the trigger to run() and detect the situation (recording it in > the trigger's internal state) so that the test can subsequently "update" the > trigger, forcing a new instance to restore the old state and then execute the > trigger actions. This can result i na lot of flaky-ness if the triggers > don't run when "expected" -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org