Bruce Schuchardt created GEODE-7031:
---------------------------------------

             Summary: Attempts to send messages to alert listeners delays 
network partition detection
                 Key: GEODE-7031
                 URL: https://issues.apache.org/jira/browse/GEODE-7031
             Project: Geode
          Issue Type: Improvement
          Components: membership
            Reporter: Bruce Schuchardt


In a number of recent regression test runs in AWS we have seen network 
partition detection tests fail to detect the partition in a reasonable amount 
of time.  Logs show membership services attempting to send alerts to other 
processes that are no longer reachable.  Each attempt takes 6 * the 
member-timeout setting - that's 30 seconds for each attempt.  It would be nice 
to have a different connection-formation timeout for something like this since 
alert notification is built into the logging system that membership services 
have to use.  Since the alert system is also dependent on membership services 
functioning properly this creates a circular dependency that has historically 
caused hangs and delays such as the one described here.
{noformat}
[debug 2019/07/29 14:35:03.824 PDT <Geode Failure Detection thread 5> tid=0xc3] 
Sending (Alert "Unable to send message to 
10.32.108.136(gemfire3_host2_12249:12249)<v3>:41003" level WARNING) to 1 peers 
([10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001]) via tcp/ip

[debug 2019/07/29 14:35:03.825 PDT <Geode Failure Detection thread 5> tid=0xc3] 
created PendingConnection 
org.apache.geode.internal.tcp.ConnectionTable$PendingConnection@4f4c8630 
created by Geode Failure Detection thread 5

[info 2019/07/29 14:35:33.847 PDT <Geode Failure Detection thread 5> tid=0xc3] 
Connection: shared=true ordered=true failed to connect to peer 
10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001 because: 
java.net.SocketTimeoutException

[debug 2019/07/29 14:35:33.852 PDT <Geode Failure Detection thread 5> tid=0xc3] 
Giving up connecting to alert listener 
10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001{noformat}
 

 

 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to