[ 
https://issues.apache.org/jira/browse/SOLR-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576189#comment-15576189
 ] 

Shalin Shekhar Mangar commented on SOLR-9512:
---------------------------------------------

Noble and I discussed this offline. Here is a summary of the problem and the 
solution:

There are five cases that we need to tackle. Assuming replica x is leader:
# Case 1: x is disconnected from zk, y becomes leader
** currently — x throws error on indexing, client fails and keeps trying to 
send requests to x and fail. This will continue until X re-connects and the 
client gets a stale state flag in the response.
# Case 2: x is dead, y becomes leader
** currently - client gets connect exception or NoResponseException (for 
in-flight requests) and client keeps retrying request to x. This will continue 
until x comes back online.
# Case 3: x is disconnected from zk, no one is leader
** currently -- client keeps sending requests to x which fail because x is 
disconnected from leader. This will continue until X re-connects and the client 
gets a stale state flag in the response.
# Case 4: x is dead, no one is leader yet
** currently - client gets connect exception or NoResponseException (for 
in-flight requests) and client keeps retrying request to x. This will continue 
until x comes back online.
# Case 5: x is alive but now y is leader
** currently -- client gets a stale state flag from x and refreshes cluster 
state to see y as the new leader. All further indexing requests are sent to y.
# Case 6: client is disconnected from zk
** currently -- client keeps indexing to x. If it receives a stale state error, 
it will try to refresh cluster state, fail and continue to send further 
requests to x, keep failing and keep trying to read from zk and be stuck in a 
cycle.

Cases 1-5 are solved by a single solution -- On ConnectException, 
NoHttpResponseException, Leader disconnected from zk error, client should fetch 
state from zk again. If client fetches from zk and does not get a new version 
then this should be marked in a flag and subsequent retries should only happen 
after N seconds are elapsed or if we know for a fact that version has changed 
since the last zk fetch was made. N could be something small as 2 seconds or so.

Case 6 is more difficult. Either we can keep failing the indexing requests or 
we can ask a random Solr instance to return the latest cluster state. This is 
kinda dangerous because it can open us up to very difficult to debug bugs so I 
am inclined to punt on this for now.

> CloudSolrClient's cluster state cache can break direct updates to leaders
> -------------------------------------------------------------------------
>
>                 Key: SOLR-9512
>                 URL: https://issues.apache.org/jira/browse/SOLR-9512
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Alan Woodward
>         Attachments: SOLR-9512.patch
>
>
> This is the root cause of SOLR-9305 and (at least some of) SOLR-9390.  The 
> process goes something like this:
> Documents are added to the cluster via a CloudSolrClient, with 
> directUpdatesToLeadersOnly set to true.  CSC caches its view of the 
> DocCollection.  The leader then goes down, and is reassigned.  Next time 
> documents are added, CSC checks its cache again, and gets the old view of the 
> DocCollection.  It then tries to send the update directly to the old, now 
> down, leader, and we get ConnectionRefused.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to