[ https://issues.apache.org/jira/browse/GEODE-7152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kirk Lund updated GEODE-7152: ----------------------------- Description: AlertAppender uses a ThreadLocal to prevent recursive calls from actually doing anything. However, a recent upgrade to our log4j dependencies seems to have changed the behavior such that log4j refuses to invoke {{doAppend}} if the thread is currently handling a {{sendAlert}} initiated from {{doAppend}}. To fix this bug, we will need to change {{sendAlert}} to be asynchronous. In this run we are expecting ForcedDisconnects, but they don't occur until after the hang is declared. The appearance of these "Recursive call to appender" messages is new: {noformat} wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,554 Geode Failure Detection thread 6 ERROR Recursive call to appender ALERT {noformat} \*** Test is dropping the network {noformat} wan_bridgeNetworkPartition1-0812-213411/vm_9_locator_1_2_host2_13435.log: [info 2019/08/12 21:37:46.265 PDT <vm_9_thr_13_locator_1_2_host2_13435> tid=0x94] Dropping network connection from rs-FullRegression13040513a0i3large-hydra-client-46 to rs-FullRegression13040513a0i3large-hydra-client-1 and from rs-FullRegression13040513a0i3large-hydra-client-1 to rs-FullRegression13040513a0i3large-hydra-client-46 {noformat} \*** Here are the appender messages {noformat} wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,554 Geode Failure Detection thread 6 ERROR Recursive call to appender ALERT wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,554 Geode Failure Detection thread 4 ERROR Recursive call to appender ALERT wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,554 Geode Failure Detection thread 5 ERROR Recursive call to appender ALERT wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,556 Geode Failure Detection thread 5 ERROR Recursive call to appender ALERT wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,558 Geode Failure Detection thread 4 ERROR Recursive call to appender ALERT wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,559 Geode Failure Detection thread 4 ERROR Recursive call to appender ALERT {noformat} \*** network is dropped {noformat} wan_bridgeNetworkPartition1-0812-213411/vm_9_locator_1_2_host2_13435.log: [info 2019/08/12 21:37:47.136 PDT <vm_9_thr_13_locator_1_2_host2_13435> tid=0x94] Dropped network connection from rs-FullRegression13040513a0i3large-hydra-client-46 to rs-FullRegression13040513a0i3large-hydra-client-1 and from rs-FullRegression13040513a0i3large-hydra-client-1 to rs-FullRegression13040513a0i3large-hydra-client-46 {noformat} \*** hang is declared {noformat} wan_bridgeNetworkPartition1-0812-213411/taskmaster_15326.log: [severe 2019/08/12 21:40:44.626 PDT <master_15326> tid=0x1] Result for vm_12_thr_16_wan1Lose_host1_14084: TASK[0] splitBrain.NetworkPartitionTest.HydraTask_doEntryOperations: HANG a client exceeded max result wait sec: 180 {noformat} \*** now we see ForcedDisconnects {noformat} [fatal 2019/08/12 21:40:57.826 PDT <unicast receiver,rs-FullRegression13040513a0i3large-hydra-client-46-43284> tid=0x24] Membership service failure: Membership coordinator 10.32.110.145(gemfire3_host1_14066:14066:locator)<ec><v2>:41000 has declared that a network partition has occurred org.apache.geode.ForcedDisconnectException: Membership coordinator 10.32.110.145(gemfire3_host1_14066:14066:locator)<ec><v2>:41000 has declared that a network partition has occurred at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2506) at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1106) at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1481) at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1328) at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1266) at org.jgroups.JChannel.invokeCallback(JChannel.java:816) at org.jgroups.JChannel.up(JChannel.java:741) at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030) at org.jgroups.protocols.FRAG2.up(FRAG2.java:165) at org.jgroups.protocols.FlowControl.up(FlowControl.java:390) at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077) at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792) at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433) at org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73) at org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72) at org.jgroups.protocols.TP.passMessageUp(TP.java:1658) at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876) at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10) at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789) at org.jgroups.protocols.TP.receive(TP.java:1714) at org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152) at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701) at java.lang.Thread.run(Thread.java:748) {noformat} was: AlertAppender uses a ThreadLocal to prevent recursive calls from actually doing anything. However, a recent upgrade to our log4j dependencies seems to have changed the behavior such that log4j refuses to invoke {{doAppend}} if the thread is currently handling a {{sendAlert}} initiated from {{doAppend}}. To fix this bug, we will need to remove the use of our ThreadLocal from {{doAppend}} and change {{sendAlert}} to be asynchronous. In this run we are expecting ForcedDisconnects, but they don't occur until after the hang is declared. The appearance of these "Recursive call to appender" messages is new: {noformat} wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,554 Geode Failure Detection thread 6 ERROR Recursive call to appender ALERT {noformat} \*** Test is dropping the network {noformat} wan_bridgeNetworkPartition1-0812-213411/vm_9_locator_1_2_host2_13435.log: [info 2019/08/12 21:37:46.265 PDT <vm_9_thr_13_locator_1_2_host2_13435> tid=0x94] Dropping network connection from rs-FullRegression13040513a0i3large-hydra-client-46 to rs-FullRegression13040513a0i3large-hydra-client-1 and from rs-FullRegression13040513a0i3large-hydra-client-1 to rs-FullRegression13040513a0i3large-hydra-client-46 {noformat} \*** Here are the appender messages {noformat} wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,554 Geode Failure Detection thread 6 ERROR Recursive call to appender ALERT wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,554 Geode Failure Detection thread 4 ERROR Recursive call to appender ALERT wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,554 Geode Failure Detection thread 5 ERROR Recursive call to appender ALERT wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,556 Geode Failure Detection thread 5 ERROR Recursive call to appender ALERT wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,558 Geode Failure Detection thread 4 ERROR Recursive call to appender ALERT wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 21:37:47,559 Geode Failure Detection thread 4 ERROR Recursive call to appender ALERT {noformat} \*** network is dropped {noformat} wan_bridgeNetworkPartition1-0812-213411/vm_9_locator_1_2_host2_13435.log: [info 2019/08/12 21:37:47.136 PDT <vm_9_thr_13_locator_1_2_host2_13435> tid=0x94] Dropped network connection from rs-FullRegression13040513a0i3large-hydra-client-46 to rs-FullRegression13040513a0i3large-hydra-client-1 and from rs-FullRegression13040513a0i3large-hydra-client-1 to rs-FullRegression13040513a0i3large-hydra-client-46 {noformat} \*** hang is declared {noformat} wan_bridgeNetworkPartition1-0812-213411/taskmaster_15326.log: [severe 2019/08/12 21:40:44.626 PDT <master_15326> tid=0x1] Result for vm_12_thr_16_wan1Lose_host1_14084: TASK[0] splitBrain.NetworkPartitionTest.HydraTask_doEntryOperations: HANG a client exceeded max result wait sec: 180 {noformat} \*** now we see ForcedDisconnects {noformat} [fatal 2019/08/12 21:40:57.826 PDT <unicast receiver,rs-FullRegression13040513a0i3large-hydra-client-46-43284> tid=0x24] Membership service failure: Membership coordinator 10.32.110.145(gemfire3_host1_14066:14066:locator)<ec><v2>:41000 has declared that a network partition has occurred org.apache.geode.ForcedDisconnectException: Membership coordinator 10.32.110.145(gemfire3_host1_14066:14066:locator)<ec><v2>:41000 has declared that a network partition has occurred at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2506) at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1106) at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1481) at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1328) at org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1266) at org.jgroups.JChannel.invokeCallback(JChannel.java:816) at org.jgroups.JChannel.up(JChannel.java:741) at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030) at org.jgroups.protocols.FRAG2.up(FRAG2.java:165) at org.jgroups.protocols.FlowControl.up(FlowControl.java:390) at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077) at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792) at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433) at org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73) at org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72) at org.jgroups.protocols.TP.passMessageUp(TP.java:1658) at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876) at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10) at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789) at org.jgroups.protocols.TP.receive(TP.java:1714) at org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152) at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701) at java.lang.Thread.run(Thread.java:748) {noformat} > AlertAppender recursion causes ForcedDisconnect to hang > ------------------------------------------------------- > > Key: GEODE-7152 > URL: https://issues.apache.org/jira/browse/GEODE-7152 > Project: Geode > Issue Type: Bug > Components: management > Reporter: Kirk Lund > Assignee: Kirk Lund > Priority: Major > > AlertAppender uses a ThreadLocal to prevent recursive calls from actually > doing anything. However, a recent upgrade to our log4j dependencies seems to > have changed the behavior such that log4j refuses to invoke {{doAppend}} if > the thread is currently handling a {{sendAlert}} initiated from {{doAppend}}. > To fix this bug, we will need to change {{sendAlert}} to be asynchronous. > In this run we are expecting ForcedDisconnects, but they don't occur until > after the hang is declared. > The appearance of these "Recursive call to appender" messages is new: > {noformat} > wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 > 21:37:47,554 Geode Failure Detection thread 6 ERROR Recursive call to > appender ALERT > {noformat} > \*** Test is dropping the network > {noformat} > wan_bridgeNetworkPartition1-0812-213411/vm_9_locator_1_2_host2_13435.log: > [info 2019/08/12 21:37:46.265 PDT <vm_9_thr_13_locator_1_2_host2_13435> > tid=0x94] Dropping network connection from > rs-FullRegression13040513a0i3large-hydra-client-46 to > rs-FullRegression13040513a0i3large-hydra-client-1 and from > rs-FullRegression13040513a0i3large-hydra-client-1 to > rs-FullRegression13040513a0i3large-hydra-client-46 > {noformat} > \*** Here are the appender messages > {noformat} > wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 > 21:37:47,554 Geode Failure Detection thread 6 ERROR Recursive call to > appender ALERT > wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 > 21:37:47,554 Geode Failure Detection thread 4 ERROR Recursive call to > appender ALERT > wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 > 21:37:47,554 Geode Failure Detection thread 5 ERROR Recursive call to > appender ALERT > wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 > 21:37:47,556 Geode Failure Detection thread 5 ERROR Recursive call to > appender ALERT > wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 > 21:37:47,558 Geode Failure Detection thread 4 ERROR Recursive call to > appender ALERT > wan_bridgeNetworkPartition1-0812-213411/bgexec15919_14066.log: 2019-08-12 > 21:37:47,559 Geode Failure Detection thread 4 ERROR Recursive call to > appender ALERT > {noformat} > \*** network is dropped > {noformat} > wan_bridgeNetworkPartition1-0812-213411/vm_9_locator_1_2_host2_13435.log: > [info 2019/08/12 21:37:47.136 PDT <vm_9_thr_13_locator_1_2_host2_13435> > tid=0x94] Dropped network connection from > rs-FullRegression13040513a0i3large-hydra-client-46 to > rs-FullRegression13040513a0i3large-hydra-client-1 and from > rs-FullRegression13040513a0i3large-hydra-client-1 to > rs-FullRegression13040513a0i3large-hydra-client-46 > {noformat} > \*** hang is declared > {noformat} > wan_bridgeNetworkPartition1-0812-213411/taskmaster_15326.log: [severe > 2019/08/12 21:40:44.626 PDT <master_15326> tid=0x1] Result for > vm_12_thr_16_wan1Lose_host1_14084: TASK[0] > splitBrain.NetworkPartitionTest.HydraTask_doEntryOperations: HANG a client > exceeded max result wait sec: 180 > {noformat} > \*** now we see ForcedDisconnects > {noformat} > [fatal 2019/08/12 21:40:57.826 PDT <unicast > receiver,rs-FullRegression13040513a0i3large-hydra-client-46-43284> tid=0x24] > Membership service failure: Membership coordinator > 10.32.110.145(gemfire3_host1_14066:14066:locator)<ec><v2>:41000 has declared > that a network partition has occurred > org.apache.geode.ForcedDisconnectException: Membership coordinator > 10.32.110.145(gemfire3_host1_14066:14066:locator)<ec><v2>:41000 has declared > that a network partition has occurred > at > org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2506) > at > org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1106) > at > org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1481) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1328) > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1266) > at org.jgroups.JChannel.invokeCallback(JChannel.java:816) > at org.jgroups.JChannel.up(JChannel.java:741) > at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030) > at org.jgroups.protocols.FRAG2.up(FRAG2.java:165) > at org.jgroups.protocols.FlowControl.up(FlowControl.java:390) > at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1077) > at > org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:792) > at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:433) > at > org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:73) > at > org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72) > at org.jgroups.protocols.TP.passMessageUp(TP.java:1658) > at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1876) > at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10) > at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1789) > at org.jgroups.protocols.TP.receive(TP.java:1714) > at > org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:152) > at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701) > at java.lang.Thread.run(Thread.java:748) > {noformat} -- This message was sent by Atlassian Jira (v8.3.2#803003)