[ 
https://issues.apache.org/jira/browse/GEODE-7031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899131#comment-16899131
 ] 

ASF subversion and git services commented on GEODE-7031:
--------------------------------------------------------

Commit a10af1ba201161c8cf3f8003a12c187728e2874e in geode's branch 
refs/heads/develop from Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=a10af1b ]

GEODE-7031 Attempts to send messages to alert listeners delays network 
partition detection

Decrease the socket-formation timeout for Alert listeners.  Generally
we'll already have a connection to an alert listener so the decreased
timeout won't be used.  In times where there are network problems,
though, we often have to create a new tcp/ip connection to send an alert
and we don't want these to stall for too long.


> Attempts to send messages to alert listeners delays network partition 
> detection
> -------------------------------------------------------------------------------
>
>                 Key: GEODE-7031
>                 URL: https://issues.apache.org/jira/browse/GEODE-7031
>             Project: Geode
>          Issue Type: Improvement
>          Components: membership
>            Reporter: Bruce Schuchardt
>            Assignee: Bruce Schuchardt
>            Priority: Major
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> In a number of recent regression test runs in AWS we have seen network 
> partition detection tests fail to detect the partition in a reasonable amount 
> of time.  Logs show membership services attempting to send alerts to other 
> processes that are no longer reachable.  Each attempt takes 6 * the 
> member-timeout setting - that's 30 seconds for each attempt.  It would be 
> nice to have a different connection-formation timeout for something like this 
> since alert notification is built into the logging system that membership 
> services have to use.  Since the alert system is also dependent on membership 
> services functioning properly this creates a circular dependency that has 
> historically caused hangs and delays such as the one described here.
> {noformat}
> [debug 2019/07/29 14:35:03.824 PDT <Geode Failure Detection thread 5> 
> tid=0xc3] Sending (Alert "Unable to send message to 
> 10.32.108.136(gemfire3_host2_12249:12249)<v3>:41003" level WARNING) to 1 
> peers ([10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001]) via 
> tcp/ip
> [debug 2019/07/29 14:35:03.825 PDT <Geode Failure Detection thread 5> 
> tid=0xc3] created PendingConnection 
> org.apache.geode.internal.tcp.ConnectionTable$PendingConnection@4f4c8630 
> created by Geode Failure Detection thread 5
> [info 2019/07/29 14:35:33.847 PDT <Geode Failure Detection thread 5> 
> tid=0xc3] Connection: shared=true ordered=true failed to connect to peer 
> 10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001 because: 
> java.net.SocketTimeoutException
> [debug 2019/07/29 14:35:33.852 PDT <Geode Failure Detection thread 5> 
> tid=0xc3] Giving up connecting to alert listener 
> 10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001{noformat}
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to