[ 
https://issues.apache.org/jira/browse/GEODE-7031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Schuchardt resolved GEODE-7031.
-------------------------------------
    Resolution: Fixed

> Attempts to send messages to alert listeners delays network partition 
> detection
> -------------------------------------------------------------------------------
>
>                 Key: GEODE-7031
>                 URL: https://issues.apache.org/jira/browse/GEODE-7031
>             Project: Geode
>          Issue Type: Improvement
>          Components: membership
>            Reporter: Bruce Schuchardt
>            Assignee: Bruce Schuchardt
>            Priority: Major
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> In a number of recent regression test runs in AWS we have seen network 
> partition detection tests fail to detect the partition in a reasonable amount 
> of time.  Logs show membership services attempting to send alerts to other 
> processes that are no longer reachable.  Each attempt takes 6 * the 
> member-timeout setting - that's 30 seconds for each attempt.  It would be 
> nice to have a different connection-formation timeout for something like this 
> since alert notification is built into the logging system that membership 
> services have to use.  Since the alert system is also dependent on membership 
> services functioning properly this creates a circular dependency that has 
> historically caused hangs and delays such as the one described here.
> {noformat}
> [debug 2019/07/29 14:35:03.824 PDT <Geode Failure Detection thread 5> 
> tid=0xc3] Sending (Alert "Unable to send message to 
> 10.32.108.136(gemfire3_host2_12249:12249)<v3>:41003" level WARNING) to 1 
> peers ([10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001]) via 
> tcp/ip
> [debug 2019/07/29 14:35:03.825 PDT <Geode Failure Detection thread 5> 
> tid=0xc3] created PendingConnection 
> org.apache.geode.internal.tcp.ConnectionTable$PendingConnection@4f4c8630 
> created by Geode Failure Detection thread 5
> [info 2019/07/29 14:35:33.847 PDT <Geode Failure Detection thread 5> 
> tid=0xc3] Connection: shared=true ordered=true failed to connect to peer 
> 10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001 because: 
> java.net.SocketTimeoutException
> [debug 2019/07/29 14:35:33.852 PDT <Geode Failure Detection thread 5> 
> tid=0xc3] Giving up connecting to alert listener 
> 10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001{noformat}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to