Erick Erickson created SOLR-10873:
-------------------------------------
Summary: Explore a utility for periodically checking the document
counts for replicas of a shard
Key: SOLR-10873
URL: https://issues.apache.org/jira/browse/SOLR-10873
Project: Solr
Issue Type: Improvement
Security Level: Public (Default Security Level. Issues are Public)
Reporter: Erick Erickson
We've had several situations "in the field" and on the user's list where the
number of documents on different replicas of the same shard differ. I've also
seen situations where the numbers are wildly different (two orders of
magnitude). I can force this situation by, say, taking down nodes, adding
replicas that become the leader then starting the nodes back up. But it doesn't
matter whether the discrepancy is a result of "pilot error" or a problem with
the code, in either case it would be useful to flag it.
Straw-man proposal:
We create a processor (modeled on DocExpirationUpdateProcessorFactory perhaps?)
that periodically wakes up and checks that each replica in the given shard has
the same document count (and perhaps other checks TBD?). Send some kind of
notification if a problem was detected.
Issues:
1> this will require some way to deal with the differing commit times.
1a> If we require a timestamp on each document we could check the config file
to see the autocommit interval and, say, check NOW-(2 x opensearcher interval).
In that case the config would just require the field to use be specified.
1b> we could require that part of the configuration is a query to use to check
document counts. I kind of like this one.
2> How to let the admins know a discrepancy was found? e-mail? ERROR level log
message? Other?
3> How does this fit into the autoscaling initiative? This is a "monitor the
system and do something" item. If we go forward with this we should do it with
an eye toward fitting it in that framework.
3a> Is there anything we can do to auto-correct this situation? Auto-correction
could be tricky. Heuristics like "make the replica with the most documents the
leader and force full index replication on all the replicas that don't agree"
seem dangerous.
4> How to keep the impact minimal? The simple approach would be for each
replica to check all other replicas in the shard. So say there are 10 replicas
on a single shard, that would be 90 queries. It would suffice for just one of
those to check the other 9, not have all 10 check the other nine. Maybe
restrict the checker to be the leader? Or otherwise just make it one
replica/shard that does the checking?
5> It's probably useful to add a collections API call to fire this off
manually. Or maybe as part of CHECKSTATUS?
What do people think?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]