[
https://issues.apache.org/jira/browse/SOLR-8085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905435#comment-14905435
]
Mark Miller commented on SOLR-8085:
-----------------------------------
Could be a static map or something in RecoveryStrategy too - but seeing as we
already store another variable like this in the default state, made a lot of
sense to me.
With your patch and running on a patched version of 4.10.3, I was still only
seeing one other type of fail.
Docs that came in during recovery after publishing recovering and before
buffering would end up interfering with and causing a false peer sync pass if
enough of them came in.
I seemed to have worked around this issue by buffering docs before peer sync
and before publishing as RECOVERING (the signal for the leader to start sending
updates).
With my current runs using no deletes, I have not yet found a fail after this
on this version of the code.
> ChaosMonkey Safe Leader Test fail with shard inconsistency.
> -----------------------------------------------------------
>
> Key: SOLR-8085
> URL: https://issues.apache.org/jira/browse/SOLR-8085
> Project: Solr
> Issue Type: Bug
> Reporter: Mark Miller
> Attachments: SOLR-8085.patch, fail.150922_125320, fail.150922_130608
>
>
> I've been discussing this fail I found with Yonik.
> The problem seems to be that a replica tries to recover and publishes
> recovering - the attempt then fails, but docs are now coming in from the
> leader. The replica tries to recover again and has gotten enough docs to pass
> peery sync.
> I'm trying a possible solution now where we won't allow peer sync after a
> recovery that is not successful.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]