[
https://issues.apache.org/jira/browse/SOLR-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047218#comment-16047218
]
Pushkar Raste commented on SOLR-10873:
--------------------------------------
What if count is same but actual data is different.
Can we use Index fingerprint instead to verify if replicas are in sync?
> Explore a utility for periodically checking the document counts for replicas
> of a shard
> ---------------------------------------------------------------------------------------
>
> Key: SOLR-10873
> URL: https://issues.apache.org/jira/browse/SOLR-10873
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Erick Erickson
>
> We've had several situations "in the field" and on the user's list where the
> number of documents on different replicas of the same shard differ. I've also
> seen situations where the numbers are wildly different (two orders of
> magnitude). I can force this situation by, say, taking down nodes, adding
> replicas that become the leader then starting the nodes back up. But it
> doesn't matter whether the discrepancy is a result of "pilot error" or a
> problem with the code, in either case it would be useful to flag it.
> Straw-man proposal:
> We create a processor (modeled on DocExpirationUpdateProcessorFactory
> perhaps?) that periodically wakes up and checks that each replica in the
> given shard has the same document count (and perhaps other checks TBD?). Send
> some kind of notification if a problem was detected.
> Issues:
> 1> this will require some way to deal with the differing commit times.
> 1a> If we require a timestamp on each document we could check the config file
> to see the autocommit interval and, say, check NOW-(2 x opensearcher
> interval). In that case the config would just require the field to use be
> specified.
> 1b> we could require that part of the configuration is a query to use to
> check document counts. I kind of like this one.
> 2> How to let the admins know a discrepancy was found? e-mail? ERROR level
> log message? Other?
> 3> How does this fit into the autoscaling initiative? This is a "monitor the
> system and do something" item. If we go forward with this we should do it
> with an eye toward fitting it in that framework.
> 3a> Is there anything we can do to auto-correct this situation?
> Auto-correction could be tricky. Heuristics like "make the replica with the
> most documents the leader and force full index replication on all the
> replicas that don't agree" seem dangerous.
> 4> How to keep the impact minimal? The simple approach would be for each
> replica to check all other replicas in the shard. So say there are 10
> replicas on a single shard, that would be 90 queries. It would suffice for
> just one of those to check the other 9, not have all 10 check the other nine.
> Maybe restrict the checker to be the leader? Or otherwise just make it one
> replica/shard that does the checking?
> 5> It's probably useful to add a collections API call to fire this off
> manually. Or maybe as part of CHECKSTATUS?
> What do people think?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]