[ 
https://issues.apache.org/jira/browse/QPID-6560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Rudyy updated QPID-6560:
-----------------------------
    Description: 
On intruder detection a task to close VHN children and set VHN state to ERRORED 
is scheduled in Broker configuration thread. Immediately after scheduling the 
task,  ReplicatedEnvironmentFacade.close() is invoked.
ReplicatedEnvironmentFacade executors are shutdown in close method.
If any of ReplicatedEnvironmentFacade executors has a pending work (tasks to 
run) and that work needs to be performed in VHN configuration thread or Broker 
configuration thread in synchronous manner (blocking 
ReplicatedEnvironmentFacade executors threads), the executors shutdown would be 
blocked and eventually times out.

Test BDBHAVirtualHostNodeRestTest.testIntruderProtection fails sporadically as 
indicated by stack trace below:

{noformat}
junit.framework.AssertionFailedError: Attribute state did not reach expected 
value within permitted timeout 5000ms. expected:<ERRORED> but was:<ACTIVE>
        at junit.framework.Assert.fail(Assert.java:57)
        at junit.framework.Assert.failNotEquals(Assert.java:329)
        at junit.framework.Assert.assertEquals(Assert.java:78)
        at junit.framework.TestCase.assertEquals(TestCase.java:244)
        at 
org.apache.qpid.systest.rest.QpidRestTestCase.waitForAttributeChanged(QpidRestTestCase.java:117)
        at 
org.apache.qpid.server.store.berkeleydb.replication.BDBHAVirtualHostNodeRestTest.testIntruderProtection(BDBHAVirtualHostNodeRestTest.java:311)
{noformat}

The log analysis showed that the issue occurs in the following scenario:
* 2-node cluster is created
* intruder node is connected
* node1 is shutdown by intruder protection
* node2 intruder protection is triggered and task to close VHN children  is 
scheduled in Broker configuration thread.  At the same time STATE event is 
issued by JE on transition from REPLICA into UNKNOWN (as majority is lost). The 
state change logic is invoked in the ReplicatedEnvironmentFacade StateShange 
executor which in turns performs VH close in VHN configuration thread and 
blocks until VH close is completed.
* As result, VHN configuration thread will be performing VHN children close 
caused by intruder protection, StateChange executor thread will be waiting for 
completion of VH close task which is scheduled as a separate task, Broker 
configuration thread will be performing REF.close waiting for shutdown of 
StateChange executor. When task to close VHN children is complete is schedules 
task in broker configuration thread to close configuration store. The latter 
can only be performed after intruder protection logic is completed.
* Thus, we have an effective dead lock, when tasks block each other threads. 

It seems that REF.close in intruder protection functionality is not only 
redundant but harmful as it causes the effective dead lock. The deadlock 
resolves by timeout on waiting for a task executor shutdown.


  was:
On intruder detection a task to close VHN children and set VHN state to ERRORED 
is scheduled in Broker configuration thread. Immediately after scheduling the 
task,  ReplicatedEnvironmentFacade.close() is invoked.
ReplicatedEnvironmentFacade executors are shutdown in close method.
If any of ReplicatedEnvironmentFacade executors has a pending work (tasks to 
run) and that work needs to be performed in VHN configuration thread or Broker 
configuration thread in synchronous manner (blocking 
ReplicatedEnvironmentFacade executors threads), the executors shutdown would be 
blocked and eventually times out.

Test BDBHAVirtualHostNodeRestTest.testIntruderProtection fails sporadically as 
indicated by stack trace below:

{noformat}
junit.framework.AssertionFailedError: Attribute state did not reach expected 
value within permitted timeout 5000ms. expected:<ERRORED> but was:<ACTIVE>
        at junit.framework.Assert.fail(Assert.java:57)
        at junit.framework.Assert.failNotEquals(Assert.java:329)
        at junit.framework.Assert.assertEquals(Assert.java:78)
        at junit.framework.TestCase.assertEquals(TestCase.java:244)
        at 
org.apache.qpid.systest.rest.QpidRestTestCase.waitForAttributeChanged(QpidRestTestCase.java:117)
        at 
org.apache.qpid.server.store.berkeleydb.replication.BDBHAVirtualHostNodeRestTest.testIntruderProtection(BDBHAVirtualHostNodeRestTest.java:311)
{noformat}

The log analysis showed that the issue occurs in the following scenario:
* 2-node cluster is created
* intruder node is connected
* node1 is shutdown by intruder protection
* node2 intruder protection is triggered and task to close VHN children  is 
scheduled in Broker configuration thread.  At the same time STATE event is 
issued by JE on transition from REPLICA into UNKNOWN (as majority is lost). The 
state change logic is invoked in the ReplicatedEnvironmentFacade StateShange 
executor which in turns caused VH close in VHN configuration thread and blocks 
until VH close is completed.
* As result, VHN configuration thread will be performing VHN children close 
caused by intruder protection, StateChange executor thread will be waiting for 
completion of VH close task which is scheduled as a separate task, Broker 
configuration thread will be performing REF.close waiting for shutdown of 
StateChange executor. When task to close VHN children is complete is schedules 
task in broker configuration thread to close configuration store. The latter 
can only be performed after intruder protection logic is completed.
* Thus, we have an effective dead lock, when tasks block each other threads. 

It seems that REF.close in intruder protection functionality is not only 
redundant but harmful as it causes the effective dead lock. The deadlock 
resolves by timeout on waiting for a task executor shutdown.



> [Java Broker] BDB HA JE environment close on intruder detection might block 
> the execution of VHN children tasks thus causing unecessary delays in 
> shutdown of ReplicatedEnvironmentFacade executors
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: QPID-6560
>                 URL: https://issues.apache.org/jira/browse/QPID-6560
>             Project: Qpid
>          Issue Type: Bug
>    Affects Versions: 0.32
>            Reporter: Alex Rudyy
>            Assignee: Alex Rudyy
>             Fix For: 6.0 [Java]
>
>
> On intruder detection a task to close VHN children and set VHN state to 
> ERRORED is scheduled in Broker configuration thread. Immediately after 
> scheduling the task,  ReplicatedEnvironmentFacade.close() is invoked.
> ReplicatedEnvironmentFacade executors are shutdown in close method.
> If any of ReplicatedEnvironmentFacade executors has a pending work (tasks to 
> run) and that work needs to be performed in VHN configuration thread or 
> Broker configuration thread in synchronous manner (blocking 
> ReplicatedEnvironmentFacade executors threads), the executors shutdown would 
> be blocked and eventually times out.
> Test BDBHAVirtualHostNodeRestTest.testIntruderProtection fails sporadically 
> as indicated by stack trace below:
> {noformat}
> junit.framework.AssertionFailedError: Attribute state did not reach expected 
> value within permitted timeout 5000ms. expected:<ERRORED> but was:<ACTIVE>
>       at junit.framework.Assert.fail(Assert.java:57)
>       at junit.framework.Assert.failNotEquals(Assert.java:329)
>       at junit.framework.Assert.assertEquals(Assert.java:78)
>       at junit.framework.TestCase.assertEquals(TestCase.java:244)
>       at 
> org.apache.qpid.systest.rest.QpidRestTestCase.waitForAttributeChanged(QpidRestTestCase.java:117)
>       at 
> org.apache.qpid.server.store.berkeleydb.replication.BDBHAVirtualHostNodeRestTest.testIntruderProtection(BDBHAVirtualHostNodeRestTest.java:311)
> {noformat}
> The log analysis showed that the issue occurs in the following scenario:
> * 2-node cluster is created
> * intruder node is connected
> * node1 is shutdown by intruder protection
> * node2 intruder protection is triggered and task to close VHN children  is 
> scheduled in Broker configuration thread.  At the same time STATE event is 
> issued by JE on transition from REPLICA into UNKNOWN (as majority is lost). 
> The state change logic is invoked in the ReplicatedEnvironmentFacade 
> StateShange executor which in turns performs VH close in VHN configuration 
> thread and blocks until VH close is completed.
> * As result, VHN configuration thread will be performing VHN children close 
> caused by intruder protection, StateChange executor thread will be waiting 
> for completion of VH close task which is scheduled as a separate task, Broker 
> configuration thread will be performing REF.close waiting for shutdown of 
> StateChange executor. When task to close VHN children is complete is 
> schedules task in broker configuration thread to close configuration store. 
> The latter can only be performed after intruder protection logic is completed.
> * Thus, we have an effective dead lock, when tasks block each other threads. 
> It seems that REF.close in intruder protection functionality is not only 
> redundant but harmful as it causes the effective dead lock. The deadlock 
> resolves by timeout on waiting for a task executor shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to