[
https://issues.apache.org/jira/browse/GEODE-9000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297734#comment-17297734
]
ASF GitHub Bot commented on GEODE-9000:
---------------------------------------
echobravopapa commented on a change in pull request #6100:
URL: https://github.com/apache/geode/pull/6100#discussion_r589799254
##########
File path:
geode-membership/src/main/java/org/apache/geode/distributed/internal/membership/gms/membership/GMSJoinLeave.java
##########
@@ -1456,12 +1456,18 @@ void
processNetworkPartitionMessage(NetworkPartitionMessage<ID> msg) {
return;
}
ID sender = msg.getSender();
- if (getView().getMembers().contains(sender)) {
- String str = "Membership coordinator " + msg.getSender()
- + " has declared that a network partition has occurred";
- forceDisconnect(str);
+
+ if (getView() != null && isJoined) {
+ if (getView().getMembers().contains(sender)) {
+ String str = "Membership coordinator " + msg.getSender()
+ + " has declared that a network partition has occurred";
+ forceDisconnect(str);
+ } else {
+ logger.warn("Ignoring the network partition message from a non-member:
" + msg.getSender());
+ }
} else {
- logger.warn("Ignoring the network partition message from a non-member: "
+ msg.getSender());
+ logger.info(
+ "Ignoring, likely this message was intended for the previous
Membership service... ");
Review comment:
wilco
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> NPE During Reconnect After Network Split
> ----------------------------------------
>
> Key: GEODE-9000
> URL: https://issues.apache.org/jira/browse/GEODE-9000
> Project: Geode
> Issue Type: Bug
> Components: membership
> Affects Versions: 1.14.0
> Reporter: Juan Ramos
> Assignee: Ernest Burghardt
> Priority: Major
> Labels: blocks-1.14.0, pull-request-available
>
> During a full network split when all members get shutdown by a partition, one
> of the servers continually fails to reconnect due to a
> {{NullPointerException}}. When using persistent regions, this also prevents
> the remaining members from correctly start up as they might be waiting for
> the stuck member to recover the latest data.
> The issue itself has been introduced by the fix for GEODE-8901, the new
> implementation for {{GMSJoinLeave.processNetworkPartitionMessage}} doesn't
> have a {{currentView}} installed during the reconnect phase ({{getView() ==
> null}}) and the following is shown in the logs:
> {noformat}
> [fatal 2021/03/04 03:32:02.744 GMT gemfire-cluster-server-0 <ReconnectThread>
> tid=0x8a] Unexpected exception while booting membership services
> java.lang.NullPointerException
> at
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459)
> at
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343)
> at
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428)
> at
> org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:210)
> at
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1782)
> at
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171)
> at
> org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
> at
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239)
> at
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951)
> at java.base/java.lang.Thread.run(Thread.java:834)
> [error 2021/03/04 03:32:02.747 GMT gemfire-cluster-server-0 <ReconnectThread>
> tid=0x8a] Unexpected problem starting up membership services
> java.lang.NullPointerException
> at
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processNetworkPartitionMessage(GMSJoinLeave.java:1459)
> at
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1343)
> at
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger.started(JGroupsMessenger.java:428)
> at
> org.apache.geode.distributed.internal.membership.gms.Services.start(Services.java:210)
> at
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.start(GMSMembership.java:1782)
> at
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:171)
> at
> org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
> at
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239)
> at
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951)
> at java.base/java.lang.Thread.run(Thread.java:834)
> [warn 2021/03/04 03:32:02.748 GMT gemfire-cluster-server-0 <ReconnectThread>
> tid=0x8a] Caught SystemConnectException in reconnect
> org.apache.geode.SystemConnectException: Problem starting up membership
> services: null. Consult log file for more details
> at
> org.apache.geode.distributed.internal.DistributionImpl.start(DistributionImpl.java:189)
> at
> org.apache.geode.distributed.internal.DistributionImpl.createDistribution(DistributionImpl.java:222)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:464)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager.<init>(ClusterDistributionManager.java:497)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager.create(ClusterDistributionManager.java:326)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.initialize(InternalDistributedSystem.java:779)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.access$200(InternalDistributedSystem.java:135)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem$Builder.build(InternalDistributedSystem.java:3034)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.connectInternal(InternalDistributedSystem.java:290)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2605)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2424)
> at
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1275)
> at
> org.apache.geode.distributed.internal.ClusterDistributionManager$DMListener.membershipFailure(ClusterDistributionManager.java:2315)
> at
> org.apache.geode.distributed.internal.membership.gms.GMSMembership.uncleanShutdown(GMSMembership.java:1239)
> at
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.lambda$forceDisconnect$0(GMSMembership.java:1951)
> at java.base/java.lang.Thread.run(Thread.java:834)
> [info 2021/03/04 03:32:02.749 GMT gemfire-cluster-server-0 <ReconnectThread>
> tid=0x8a] Disconnecting old DistributedSystem to prepare for a reconnect
> attempt
> {noformat}
> The above keeps happening during further reconnect attempts and the server
> member can't re-join the distributed system.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)