[
https://issues.apache.org/jira/browse/GEODE-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332052#comment-15332052
]
ASF subversion and git services commented on GEODE-1542:
--------------------------------------------------------
Commit 33ceb371554a13c7643ddaf9488ffa83963de1e7 in incubator-geode's branch
refs/heads/GEODE-1372 from [~bschuchardt]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=33ceb37 ]
GEODE-1542 shared/unordered tcp/ip connection times out, initiating suspicion
This disables timing out of shared/unordered TcpConduit connections. We don't
want them to time out because we are using them to initiate suspect processing
on other members.
The ticket also pointed out a problem with the "final check" mechanism in
the health monitor. I tracked that down to improper use of SocketCreator
to create the server-socket in GMSHealthMonitor. It was creating sn SSL
socket if SSL is enabled but the client-side of the check uses non-SSL
sockets. I changed the server to use non-SSL sockets as well since no
useful information is sent over the final-check TCP/IP connections & they
need to be lightweight and fast.
While looking at logs I also found that the heartbeat request sent at the
beginning of a final-check had a request-ID even though it's not waiting
for a response. That causes processing of the response to do more work
than necessary so I changed it to remove the request-ID from the message.
> shared/unordered tcp/ip connection times out, initiating suspicion
> ------------------------------------------------------------------
>
> Key: GEODE-1542
> URL: https://issues.apache.org/jira/browse/GEODE-1542
> Project: Geode
> Issue Type: Bug
> Components: membership
> Reporter: Bruce Schuchardt
> Fix For: 1.0.0-incubating.M3
>
>
> I recently diagnosed a membership failure that was initiated when one member
> (N) timed out its shared/unordered tcp/ip connection to another member (M).
> Member M initiated suspect processing that lead to kicking member N out of
> the system. We need to either stop timing out shared/unordered connections
> or have an orderly shutdown mechanism so that we don't initiate suspect
> processing.
> The final-check that M performed showed something odd. Member N never logged
> that it processed a final check from M. Member M logged that it had
> connected to N and read a status byte from it. The byte had the value 21,
> which is not a valid response to a final check (it should be 0 or 0x7B).
> {noformat}
> Received [21, ent(clientgemfire3_ent_19225:19225)<ec><v1>:1028]
> {noformat}
> I verified that M used the correct tcp/ip port for N, so this is very odd and
> needs to be investigated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)