[ 
https://issues.apache.org/jira/browse/SOLR-13118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man reassigned SOLR-13118:
-------------------------------

      Assignee: Hoss Man
    Attachment: SOLR-13118.patch

In the attached patch, I've remedied this (in 
{{TestSimTriggerIntegration.testNodeLostTriggerRestoreState}} ) via the 
following (non-test) API changes:
 * {{SimCloudManager}} now exposes a public {{getOverseerTriggerThread()}} 
method – allowing Sim tests the same level of accesss to the 
OverseerTriggerThread/ScheduledTriggers that non-sim tests can get via the 
Overseer ode (by way of it's JettyRunner->CoreContainer)
 * {{ScheduledTriggers}} now exposes a public (lucene.internal for tests only) 
method to {{getTrigger(String name)}}
 * {{TriggerBase}} has been refactored to expose a public (lucene.internal for 
tests only) method to get a {{deepCopyState()}}

With these changes, the test(s) can now use the following flow...
 * register an initial trigger w/an effectively infinite {{waitFor}} 
configuration
 * create the nodeAdd/nodeLost situation
 * reach into the ScheduledTriggers in a {{TimeOut.waitFor(...)}} to inspect 
the state of the trigger being tests to know once it's run() and detected the 
situation/event and started tracking it
 ** but not yet executed any actions because of the {{waitFor}} property
 * update the trigger w/ {{waitFor: 0s}}
 * use the latches & event queues of the "mock" trigger actions to confirm that 
the event state information gets preserved from one trigger instance to the 
next, and ultimately process()ed

----

Unless there are any concerns with this approach, I'll try to update the other 
similarly problematic tests ASAP...
 * TestSimTriggerIntegration.testNodeAddedTriggerRestoreState
 * NodeLostTriggerIntegrationTest.testNodeLostTriggerRestoreState
 * NodeAddedTriggerIntegrationTest.testNodeAddedTriggerRestoreState

(I'd also like to look into re-writing 
TestSimTriggerIntegration.testEventFromRestoredState to use an effectively 
infinit {{waitFor}} and the new direct state introspection before & after 
restarting the overseer rather then the current (delicately problematic) 
precise sleep times ... but that's a lower priority at the moment that i 
haven't fully thought through)

> Redesign integration tests for nodeAdded/nodeLost trigger state restoration
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-13118
>                 URL: https://issues.apache.org/jira/browse/SOLR-13118
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Hoss Man
>            Priority: Major
>         Attachments: SOLR-13118.patch
>
>
> The (integration) tests related to autoscaling nodeAdd/nodeLost trigger's and 
> restoring their state are problematic for a lot of reasons.
> Beyond some silly implementation mistakes, a fundemental timing/concurrency 
> issue is that (as designed) the tests have no way to ensure that "after" 
> creating a nodeAdded/nodeLost situation, they can wait for the (first 
> instance of) the trigger to run() and detect the situation (recording it in 
> the trigger's internal state) so that the test can subsequently "update" the 
> trigger, forcing a new instance to restore the old state and then execute the 
> trigger actions.  This can result i na lot of flaky-ness if the triggers 
> don't run when "expected"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to