[ 
https://issues.apache.org/jira/browse/SOLR-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905435#comment-14905435
 ] 

Mark Miller commented on SOLR-8085:
-----------------------------------

Could be a static map or something in RecoveryStrategy too - but seeing as we 
already store another variable like this in the default state, made a lot of 
sense to me.

With your patch and running on a patched version of 4.10.3, I was still only 
seeing one other type of fail.

Docs that came in during recovery after publishing recovering and before 
buffering would end up interfering with and causing a false peer sync pass if 
enough of them came in.

I seemed to have worked around this issue by buffering docs before peer sync 
and before publishing as RECOVERING (the signal for the leader to start sending 
updates).

With my current runs using no deletes, I have not yet found a fail after this 
on this version of the code.

> ChaosMonkey Safe Leader Test fail with shard inconsistency.
> -----------------------------------------------------------
>
>                 Key: SOLR-8085
>                 URL: https://issues.apache.org/jira/browse/SOLR-8085
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Mark Miller
>         Attachments: SOLR-8085.patch, fail.150922_125320, fail.150922_130608
>
>
> I've been discussing this fail I found with Yonik.
> The problem seems to be that a replica tries to recover and publishes 
> recovering - the attempt then fails, but docs are now coming in from the 
> leader. The replica tries to recover again and has gotten enough docs to pass 
> peery sync.
> I'm trying a possible solution now where we won't allow peer sync after a 
> recovery that is not successful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to