[ 
https://issues.apache.org/jira/browse/QPID-6560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Wall updated QPID-6560:
-----------------------------
    Summary: [Java Broker] BDB HA JE environment close on intruder detection 
might block the execution of VHN children tasks thus causing unnecessary delays 
in shutdown of ReplicatedEnvironmentFacade executors  (was: [Java Broker] BDB 
HA JE environment close on intruder detection might block the execution of VHN 
children tasks thus causing unecessary delays in shutdown of 
ReplicatedEnvironmentFacade executors)

> [Java Broker] BDB HA JE environment close on intruder detection might block 
> the execution of VHN children tasks thus causing unnecessary delays in 
> shutdown of ReplicatedEnvironmentFacade executors
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: QPID-6560
>                 URL: https://issues.apache.org/jira/browse/QPID-6560
>             Project: Qpid
>          Issue Type: Bug
>    Affects Versions: 0.32
>            Reporter: Alex Rudyy
>            Assignee: Alex Rudyy
>             Fix For: qpid-java-6.0
>
>
> On intruder detection a task to close VHN children and set VHN state to 
> ERRORED is scheduled in Broker configuration thread. Immediately after 
> scheduling the task,  ReplicatedEnvironmentFacade.close() is invoked.
> ReplicatedEnvironmentFacade executors are shutdown in close method.
> If any of ReplicatedEnvironmentFacade executors has a pending work (tasks to 
> run) and that work needs to be performed in VHN configuration thread or 
> Broker configuration thread in synchronous manner (blocking 
> ReplicatedEnvironmentFacade executors threads), the executors shutdown would 
> be blocked and eventually times out.
> Test BDBHAVirtualHostNodeRestTest.testIntruderProtection fails sporadically 
> as indicated by stack trace below:
> {noformat}
> junit.framework.AssertionFailedError: Attribute state did not reach expected 
> value within permitted timeout 5000ms. expected:<ERRORED> but was:<ACTIVE>
>       at junit.framework.Assert.fail(Assert.java:57)
>       at junit.framework.Assert.failNotEquals(Assert.java:329)
>       at junit.framework.Assert.assertEquals(Assert.java:78)
>       at junit.framework.TestCase.assertEquals(TestCase.java:244)
>       at 
> org.apache.qpid.systest.rest.QpidRestTestCase.waitForAttributeChanged(QpidRestTestCase.java:117)
>       at 
> org.apache.qpid.server.store.berkeleydb.replication.BDBHAVirtualHostNodeRestTest.testIntruderProtection(BDBHAVirtualHostNodeRestTest.java:311)
> {noformat}
> The log analysis showed that the issue occurs in the following scenario:
> * 2-node cluster is created
> * intruder node is connected
> * node1 is shutdown by intruder protection
> * node2 intruder protection is triggered and task to close VHN children  is 
> scheduled in Broker configuration thread.  At the same time STATE event is 
> issued by JE on transition from REPLICA into UNKNOWN (as majority is lost). 
> The state change logic is invoked in the ReplicatedEnvironmentFacade 
> StateShange executor which in turns performs VH close in VHN configuration 
> thread and blocks until VH close is completed.
> * As result, VHN configuration thread will be performing VHN children close 
> caused by intruder protection, StateChange executor thread will be waiting 
> for completion of VH close task which is scheduled as a separate task, Broker 
> configuration thread will be performing REF.close waiting for shutdown of 
> StateChange executor. When task to close VHN children is complete is 
> schedules task in broker configuration thread to close configuration store. 
> The latter can only be performed after intruder protection logic is completed.
> * Thus, we have an effective dead lock, when tasks block each other threads. 
> It seems that REF.close in intruder protection functionality is not only 
> redundant but harmful as it causes the effective dead lock. The deadlock 
> resolves by timeout on waiting for a task executor shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to