[jira] [Commented] (GEODE-8231) C++ native client keeps trying to connect to down cache server hosting a partitioned region

Ernest Burghardt (Jira) Tue, 13 Apr 2021 13:59:05 -0700


    [ 
https://issues.apache.org/jira/browse/GEODE-8231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320535#comment-17320535
 ]


Ernest Burghardt commented on GEODE-8231:
-----------------------------------------

*This is being reverted for causing data loss, see below for details*:

For single-hop PUTALL, the request from the app is broken up in Geode native as 
follows:

i. Each value is hashed to a bucket, the server corresponding to the bucket is 
looked up in the metadata, and the value is added to a server-specific list for 
that server.

ii. When all values are added to a list, Geode native spins up a thread for 
each list, and sends a PUTALL to each server.

 

When a server can't be reached by Geode native, its entries are removed from 
the metadata, and the bucket-to-server lookup fails.  This situation is handled 
as follows:
 i. the size of the "leftover keys" list is divided by the number of servers, 
then 1 added to compensate for any fractional piece.

ii. That many keys are added to each remaining list going to a server that is 
still reachable.

iii. We proceed normally, and send one list to each server, on its own thread.

 

_Unfortunately_, this scenario can lead to data loss, because each of the 
fractional pieces of the list going to the unreachable server has an eventId 
with the same threadId and incrementing sequenceId.  Thus, if any of our PUTALL 
threads send out-of-order, the earlier sequenceIds will be marked as already 
"seen" on the server and _dropped_.

> C++ native client keeps trying to connect to down cache server hosting a 
> partitioned region
> -------------------------------------------------------------------------------------------
>
>                 Key: GEODE-8231
>                 URL: https://issues.apache.org/jira/browse/GEODE-8231
>             Project: Geode
>          Issue Type: Bug
>          Components: native client
>            Reporter: Alberto Gomez
>            Assignee: Alberto Gomez
>            Priority: Major
>             Fix For: 1.14.0
>
>
> If a C++ client connected to a cluster is sending operations to a partitioned 
> region and one of the server goes down, the client keeps trying to send 
> operations to the down server. This can be observed in the logs by a 
> continuous flow of lines containing: "IO error in handshake with endpoint..."
> The Java client, once it detects a server is down, it deletes it from the 
> client metadata so there are no tries to connect to the server until the 
> server is up again which is notified via a metadata refresh.
> The aim of this ticket is to align the behavior of the C++ native client to 
> the Java client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (GEODE-8231) C++ native client keeps trying to connect to down cache server hosting a partitioned region

Reply via email to