[ 
https://issues.apache.org/jira/browse/GEODE-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15332052#comment-15332052
 ] 

ASF subversion and git services commented on GEODE-1542:
--------------------------------------------------------

Commit 33ceb371554a13c7643ddaf9488ffa83963de1e7 in incubator-geode's branch 
refs/heads/GEODE-1372 from [~bschuchardt]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-geode.git;h=33ceb37 ]

GEODE-1542 shared/unordered tcp/ip connection times out, initiating suspicion

This disables timing out of shared/unordered TcpConduit connections.  We don't
want them to time out because we are using them to initiate suspect processing
on other members.

The ticket also pointed out a problem with the "final check" mechanism in
the health monitor.  I tracked that down to improper use of SocketCreator
to create the server-socket in GMSHealthMonitor.  It was creating sn SSL
socket if SSL is enabled but the client-side of the check uses non-SSL
sockets.  I changed the server to use non-SSL sockets as well since no
useful information is sent over the final-check TCP/IP connections & they
need to be lightweight and fast.

While looking at logs I also found that the heartbeat request sent at the
beginning of a final-check had a request-ID even though it's not waiting
for a response.  That causes processing of the response to do more work
than necessary so I changed it to remove the request-ID from the message.


> shared/unordered tcp/ip connection times out, initiating suspicion
> ------------------------------------------------------------------
>
>                 Key: GEODE-1542
>                 URL: https://issues.apache.org/jira/browse/GEODE-1542
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>            Reporter: Bruce Schuchardt
>             Fix For: 1.0.0-incubating.M3
>
>
> I recently diagnosed a membership failure that was initiated when one member 
> (N) timed out its shared/unordered tcp/ip connection to another member (M).  
> Member M initiated suspect processing that lead to kicking member N out of 
> the system.  We need to either stop timing out shared/unordered connections 
> or have an orderly shutdown mechanism so that we don't initiate suspect 
> processing.
> The final-check that M performed showed something odd.  Member N never logged 
> that it processed a final check from M.  Member M logged that it had 
> connected to N and read a status byte from it.  The byte had the value 21, 
> which is not a valid response to a final check (it should be 0 or 0x7B).
> {noformat}
> Received [21, ent(clientgemfire3_ent_19225:19225)<ec><v1>:1028]
> {noformat}
> I verified that M used the correct tcp/ip port for N, so this is very odd and 
> needs to be investigated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to