[
https://issues.apache.org/jira/browse/GEODE-7031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899131#comment-16899131
]
ASF subversion and git services commented on GEODE-7031:
--------------------------------------------------------
Commit a10af1ba201161c8cf3f8003a12c187728e2874e in geode's branch
refs/heads/develop from Bruce Schuchardt
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=a10af1b ]
GEODE-7031 Attempts to send messages to alert listeners delays network
partition detection
Decrease the socket-formation timeout for Alert listeners. Generally
we'll already have a connection to an alert listener so the decreased
timeout won't be used. In times where there are network problems,
though, we often have to create a new tcp/ip connection to send an alert
and we don't want these to stall for too long.
> Attempts to send messages to alert listeners delays network partition
> detection
> -------------------------------------------------------------------------------
>
> Key: GEODE-7031
> URL: https://issues.apache.org/jira/browse/GEODE-7031
> Project: Geode
> Issue Type: Improvement
> Components: membership
> Reporter: Bruce Schuchardt
> Assignee: Bruce Schuchardt
> Priority: Major
> Time Spent: 2h 40m
> Remaining Estimate: 0h
>
> In a number of recent regression test runs in AWS we have seen network
> partition detection tests fail to detect the partition in a reasonable amount
> of time. Logs show membership services attempting to send alerts to other
> processes that are no longer reachable. Each attempt takes 6 * the
> member-timeout setting - that's 30 seconds for each attempt. It would be
> nice to have a different connection-formation timeout for something like this
> since alert notification is built into the logging system that membership
> services have to use. Since the alert system is also dependent on membership
> services functioning properly this creates a circular dependency that has
> historically caused hangs and delays such as the one described here.
> {noformat}
> [debug 2019/07/29 14:35:03.824 PDT <Geode Failure Detection thread 5>
> tid=0xc3] Sending (Alert "Unable to send message to
> 10.32.108.136(gemfire3_host2_12249:12249)<v3>:41003" level WARNING) to 1
> peers ([10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001]) via
> tcp/ip
> [debug 2019/07/29 14:35:03.825 PDT <Geode Failure Detection thread 5>
> tid=0xc3] created PendingConnection
> org.apache.geode.internal.tcp.ConnectionTable$PendingConnection@4f4c8630
> created by Geode Failure Detection thread 5
> [info 2019/07/29 14:35:33.847 PDT <Geode Failure Detection thread 5>
> tid=0xc3] Connection: shared=true ordered=true failed to connect to peer
> 10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001 because:
> java.net.SocketTimeoutException
> [debug 2019/07/29 14:35:33.852 PDT <Geode Failure Detection thread 5>
> tid=0xc3] Giving up connecting to alert listener
> 10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001{noformat}
>
>
>
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)