Erick Erickson created SOLR-10873:
-------------------------------------

             Summary: Explore a utility for periodically checking the document 
counts for replicas of a shard
                 Key: SOLR-10873
                 URL: https://issues.apache.org/jira/browse/SOLR-10873
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Erick Erickson


We've had several situations "in the field" and on the user's list where the 
number of documents on different replicas of the same shard differ. I've also 
seen situations where the numbers are wildly different (two orders of 
magnitude). I can force this situation by, say, taking down nodes, adding 
replicas that become the leader then starting the nodes back up. But it doesn't 
matter whether the discrepancy is a result of "pilot error" or a problem with 
the code, in either case it would be useful to flag it.

Straw-man proposal:
We create a processor (modeled on DocExpirationUpdateProcessorFactory perhaps?) 
that periodically wakes up and checks that each replica in the given shard has 
the same document count (and perhaps other checks TBD?). Send some kind of 
notification if a problem was detected.

Issues:
1> this will require some way to deal with the differing commit times. 
1a> If we require a timestamp on each document we could check the config file 
to see the autocommit interval and, say, check NOW-(2 x opensearcher interval). 
In that case the config would just require the field to use be specified.
1b> we could require that part of the configuration is a query to use to check 
document counts. I kind of like this one.

2> How to let the admins know a discrepancy was found? e-mail? ERROR level log 
message? Other?

3> How does this fit into the autoscaling initiative? This is a "monitor the 
system and do something" item. If we go forward with this we should do it with 
an eye toward fitting it in that framework.
3a> Is there anything we can do to auto-correct this situation? Auto-correction 
could be tricky. Heuristics like "make the replica with the most documents the 
leader and force full index replication on all the replicas that don't agree" 
seem dangerous. 

4> How to keep the impact minimal? The simple approach would be for each 
replica to check all other replicas in the shard. So say there are 10 replicas 
on a single shard, that would be 90 queries. It would suffice for just one of 
those to check the other 9, not have all 10 check the other nine. Maybe 
restrict the checker to be the leader? Or otherwise just make it one 
replica/shard that does the checking?

5> It's probably useful to add a collections API call to fire this off 
manually. Or maybe as part of CHECKSTATUS?

What do people think?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to