[ 
https://issues.apache.org/jira/browse/GEODE-7152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kirk Lund updated GEODE-7152:
-----------------------------
    Description: 
AlertAppender uses a ThreadLocal to prevent recursive calls from actually doing 
anything. However, a recent upgrade to our log4j dependencies seems to have 
changed the behavior such that log4j refuses to invoke {{doAppend}} if the 
thread is currently handling a {{sendAlert}} initiated from {{doAppend}}. To 
fix this bug, we will need to change {{sendAlert}} to be asynchronous.

In this run we are expecting ForcedDisconnects, but they don't occur until 
after the hang is declared.

The appearance of these "Recursive call to appender" messages is new:
{noformat}
wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,554 Geode Failure Detection thread 6 ERROR Recursive call to appender 
ALERT
{noformat}
\*** Test is dropping the network
{noformat}
wan_bridgeNetworkPartition1-0812-213411/vm_9_locator_1_2_host2_13435.log: [info 
2019/08/12 21:37:46.265 PDT <vm_9_thr_13_locator_1_2_host2_13435> tid=0x94] 
Dropping network connection from 
rs-FullRegression13040513a0i3large-hydra-client-46 to 
rs-FullRegression13040513a0i3large-hydra-client-1 and from 
rs-FullRegression13040513a0i3large-hydra-client-1 to 
rs-FullRegression13040513a0i3large-hydra-client-46
{noformat}
\*** Here are the appender messages
{noformat}
wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,554 Geode Failure Detection thread 6 ERROR Recursive call to appender 
ALERT

wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,554 Geode Failure Detection thread 4 ERROR Recursive call to appender 
ALERT

wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,554 Geode Failure Detection thread 5 ERROR Recursive call to appender 
ALERT

wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,556 Geode Failure Detection thread 5 ERROR Recursive call to appender 
ALERT

wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,558 Geode Failure Detection thread 4 ERROR Recursive call to appender 
ALERT

wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,559 Geode Failure Detection thread 4 ERROR Recursive call to appender 
ALERT
{noformat}
\*** network is dropped
{noformat}
wan_bridgeNetworkPartition1-0812-213411/vm_9_locator_1_2_host2_13435.log: [info 
2019/08/12 21:37:47.136 PDT <vm_9_thr_13_locator_1_2_host2_13435> tid=0x94] 
Dropped network connection from 
rs-FullRegression13040513a0i3large-hydra-client-46 to 
rs-FullRegression13040513a0i3large-hydra-client-1 and from 
rs-FullRegression13040513a0i3large-hydra-client-1 to 
rs-FullRegression13040513a0i3large-hydra-client-46
{noformat}
\*** hang is declared
{noformat}
wan_bridgeNetworkPartition1-0812-213411/taskmaster_15326.log: [severe 
2019/08/12 21:40:44.626 PDT <master_15326> tid=0x1] Result for 
vm_12_thr_16_wan1Lose_host1_14084: TASK[0] 
splitBrain.NetworkPartitionTest.HydraTask_doEntryOperations: HANG a client 
exceeded max result wait sec: 180
{noformat}
\*** now we see ForcedDisconnects
{noformat}
[fatal 2019/08/12 21:40:57.826 PDT <unicast 
receiver,rs-FullRegression13040513a0i3large-hydra-client-46-43284> tid=0x24] 
Membership service failure: Membership coordinator 
10.32.110.145(gemfire3_host1_14066:14066:locator)<ec><v2>:41000 has declared 
that a network partition has occurred
org.apache.geode.ForcedDisconnectException: Membership coordinator 
10.32.110.145(gemfire3_host1_14066:14066:locator)<ec><v2>:41000 has declared 
that a network partition has occurred
        at 
org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2506)
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1106)
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1481)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1328)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1266)
        at org.jgroups.JChannel.invokeCallback(JChannel.java:816)
        at org.jgroups.JChannel.up(JChannel.java:741)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:390)
        at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077)
        at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792)
        at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
        at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
        at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
        at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789)
        at org.jgroups.protocols.TP.receive(TP.java:1714)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152)
        at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701)
        at java.lang.Thread.run(Thread.java:748)
{noformat}

  was:
AlertAppender uses a ThreadLocal to prevent recursive calls from actually doing 
anything. However, a recent upgrade to our log4j dependencies seems to have 
changed the behavior such that log4j refuses to invoke {{doAppend}} if the 
thread is currently handling a {{sendAlert}} initiated from {{doAppend}}. To 
fix this bug, we will need to remove the use of our ThreadLocal from 
{{doAppend}} and change {{sendAlert}} to be asynchronous.

In this run we are expecting ForcedDisconnects, but they don't occur until 
after the hang is declared.

The appearance of these "Recursive call to appender" messages is new:
{noformat}
wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,554 Geode Failure Detection thread 6 ERROR Recursive call to appender 
ALERT
{noformat}
\*** Test is dropping the network
{noformat}
wan_bridgeNetworkPartition1-0812-213411/vm_9_locator_1_2_host2_13435.log: [info 
2019/08/12 21:37:46.265 PDT <vm_9_thr_13_locator_1_2_host2_13435> tid=0x94] 
Dropping network connection from 
rs-FullRegression13040513a0i3large-hydra-client-46 to 
rs-FullRegression13040513a0i3large-hydra-client-1 and from 
rs-FullRegression13040513a0i3large-hydra-client-1 to 
rs-FullRegression13040513a0i3large-hydra-client-46
{noformat}
\*** Here are the appender messages
{noformat}
wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,554 Geode Failure Detection thread 6 ERROR Recursive call to appender 
ALERT

wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,554 Geode Failure Detection thread 4 ERROR Recursive call to appender 
ALERT

wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,554 Geode Failure Detection thread 5 ERROR Recursive call to appender 
ALERT

wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,556 Geode Failure Detection thread 5 ERROR Recursive call to appender 
ALERT

wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,558 Geode Failure Detection thread 4 ERROR Recursive call to appender 
ALERT

wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
21:37:47,559 Geode Failure Detection thread 4 ERROR Recursive call to appender 
ALERT
{noformat}
\*** network is dropped
{noformat}
wan_bridgeNetworkPartition1-0812-213411/vm_9_locator_1_2_host2_13435.log: [info 
2019/08/12 21:37:47.136 PDT <vm_9_thr_13_locator_1_2_host2_13435> tid=0x94] 
Dropped network connection from 
rs-FullRegression13040513a0i3large-hydra-client-46 to 
rs-FullRegression13040513a0i3large-hydra-client-1 and from 
rs-FullRegression13040513a0i3large-hydra-client-1 to 
rs-FullRegression13040513a0i3large-hydra-client-46
{noformat}
\*** hang is declared
{noformat}
wan_bridgeNetworkPartition1-0812-213411/taskmaster_15326.log: [severe 
2019/08/12 21:40:44.626 PDT <master_15326> tid=0x1] Result for 
vm_12_thr_16_wan1Lose_host1_14084: TASK[0] 
splitBrain.NetworkPartitionTest.HydraTask_doEntryOperations: HANG a client 
exceeded max result wait sec: 180
{noformat}
\*** now we see ForcedDisconnects
{noformat}
[fatal 2019/08/12 21:40:57.826 PDT <unicast 
receiver,rs-FullRegression13040513a0i3large-hydra-client-46-43284> tid=0x24] 
Membership service failure: Membership coordinator 
10.32.110.145(gemfire3_host1_14066:14066:locator)<ec><v2>:41000 has declared 
that a network partition has occurred
org.apache.geode.ForcedDisconnectException: Membership coordinator 
10.32.110.145(gemfire3_host1_14066:14066:locator)<ec><v2>:41000 has declared 
that a network partition has occurred
        at 
org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2506)
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1106)
        at 
org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1481)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1328)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1266)
        at org.jgroups.JChannel.invokeCallback(JChannel.java:816)
        at org.jgroups.JChannel.up(JChannel.java:741)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
        at org.jgroups.protocols.FlowControl.up(FlowControl.java:390)
        at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077)
        at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792)
        at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
        at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
        at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
        at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789)
        at org.jgroups.protocols.TP.receive(TP.java:1714)
        at 
org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152)
        at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701)
        at java.lang.Thread.run(Thread.java:748)
{noformat}


> AlertAppender recursion causes ForcedDisconnect to hang
> -------------------------------------------------------
>
>                 Key: GEODE-7152
>                 URL: https://issues.apache.org/jira/browse/GEODE-7152
>             Project: Geode
>          Issue Type: Bug
>          Components: management
>            Reporter: Kirk Lund
>            Assignee: Kirk Lund
>            Priority: Major
>
> AlertAppender uses a ThreadLocal to prevent recursive calls from actually 
> doing anything. However, a recent upgrade to our log4j dependencies seems to 
> have changed the behavior such that log4j refuses to invoke {{doAppend}} if 
> the thread is currently handling a {{sendAlert}} initiated from {{doAppend}}. 
> To fix this bug, we will need to change {{sendAlert}} to be asynchronous.
> In this run we are expecting ForcedDisconnects, but they don't occur until 
> after the hang is declared.
> The appearance of these "Recursive call to appender" messages is new:
> {noformat}
> wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
> 21:37:47,554 Geode Failure Detection thread 6 ERROR Recursive call to 
> appender ALERT
> {noformat}
> \*** Test is dropping the network
> {noformat}
> wan_bridgeNetworkPartition1-0812-213411/vm_9_locator_1_2_host2_13435.log: 
> [info 2019/08/12 21:37:46.265 PDT <vm_9_thr_13_locator_1_2_host2_13435> 
> tid=0x94] Dropping network connection from 
> rs-FullRegression13040513a0i3large-hydra-client-46 to 
> rs-FullRegression13040513a0i3large-hydra-client-1 and from 
> rs-FullRegression13040513a0i3large-hydra-client-1 to 
> rs-FullRegression13040513a0i3large-hydra-client-46
> {noformat}
> \*** Here are the appender messages
> {noformat}
> wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
> 21:37:47,554 Geode Failure Detection thread 6 ERROR Recursive call to 
> appender ALERT
> wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
> 21:37:47,554 Geode Failure Detection thread 4 ERROR Recursive call to 
> appender ALERT
> wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
> 21:37:47,554 Geode Failure Detection thread 5 ERROR Recursive call to 
> appender ALERT
> wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
> 21:37:47,556 Geode Failure Detection thread 5 ERROR Recursive call to 
> appender ALERT
> wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
> 21:37:47,558 Geode Failure Detection thread 4 ERROR Recursive call to 
> appender ALERT
> wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 
> 21:37:47,559 Geode Failure Detection thread 4 ERROR Recursive call to 
> appender ALERT
> {noformat}
> \*** network is dropped
> {noformat}
> wan_bridgeNetworkPartition1-0812-213411/vm_9_locator_1_2_host2_13435.log: 
> [info 2019/08/12 21:37:47.136 PDT <vm_9_thr_13_locator_1_2_host2_13435> 
> tid=0x94] Dropped network connection from 
> rs-FullRegression13040513a0i3large-hydra-client-46 to 
> rs-FullRegression13040513a0i3large-hydra-client-1 and from 
> rs-FullRegression13040513a0i3large-hydra-client-1 to 
> rs-FullRegression13040513a0i3large-hydra-client-46
> {noformat}
> \*** hang is declared
> {noformat}
> wan_bridgeNetworkPartition1-0812-213411/taskmaster_15326.log: [severe 
> 2019/08/12 21:40:44.626 PDT <master_15326> tid=0x1] Result for 
> vm_12_thr_16_wan1Lose_host1_14084: TASK[0] 
> splitBrain.NetworkPartitionTest.HydraTask_doEntryOperations: HANG a client 
> exceeded max result wait sec: 180
> {noformat}
> \*** now we see ForcedDisconnects
> {noformat}
> [fatal 2019/08/12 21:40:57.826 PDT <unicast 
> receiver,rs-FullRegression13040513a0i3large-hydra-client-46-43284> tid=0x24] 
> Membership service failure: Membership coordinator 
> 10.32.110.145(gemfire3_host1_14066:14066:locator)<ec><v2>:41000 has declared 
> that a network partition has occurred
> org.apache.geode.ForcedDisconnectException: Membership coordinator 
> 10.32.110.145(gemfire3_host1_14066:14066:locator)<ec><v2>:41000 has declared 
> that a network partition has occurred
>         at 
> org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2506)
>         at 
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1106)
>         at 
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1481)
>         at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1328)
>         at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1266)
>         at org.jgroups.JChannel.invokeCallback(JChannel.java:816)
>         at org.jgroups.JChannel.up(JChannel.java:741)
>         at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030)
>         at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
>         at org.jgroups.protocols.FlowControl.up(FlowControl.java:390)
>         at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077)
>         at 
> org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792)
>         at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433)
>         at 
> org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73)
>         at 
> org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72)
>         at org.jgroups.protocols.TP.passMessageUp(TP.java:1658)
>         at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876)
>         at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
>         at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789)
>         at org.jgroups.protocols.TP.receive(TP.java:1714)
>         at 
> org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152)
>         at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to