[jira] [Comment Edited] (SOLR-10873) Explore a utility for periodically checking the document counts for replicas of a shard

Pushkar Raste (JIRA) Mon, 12 Jun 2017 18:03:27 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047280#comment-16047280
 ]


Pushkar Raste edited comment on SOLR-10873 at 6/13/17 1:02 AM:
---------------------------------------------------------------

There are other advantages too
* You can compute Index fingerprint upto any arbitrary version. Depending on 
tolerance, you can check if fingerprint matches the last version in the second 
from last tlog. No need to differ commits in this case 

* Index fingerprint is cached in SolrCore class and hence even frequency of 
sync check is high you may not have recompute fingerprint every single time

`RealTimeGetcomponent` already supports a call `processGetFingerprint` while 
working on SOLR-9446

 


was (Author: praste):
There are other advantages too
* You can compute Index fingerprint upto any arbitrary version. Depending on 
tolerance, you can check if fingerprint matches the last version in second from 
last version in the tlog. No need to differ commits in this case 

* Index fingerprint is cached in SolrCore class and hence even frequency of 
sync check is high you may not have recompute fingerprint every single time

`RealTimeGetcomponent` already supports a call `processGetFingerprint` while 
working on SOLR-9446

 

> Explore a utility for periodically checking the document counts for replicas 
> of a shard
> ---------------------------------------------------------------------------------------
>
>                 Key: SOLR-10873
>                 URL: https://issues.apache.org/jira/browse/SOLR-10873
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Erick Erickson
>
> We've had several situations "in the field" and on the user's list where the 
> number of documents on different replicas of the same shard differ. I've also 
> seen situations where the numbers are wildly different (two orders of 
> magnitude). I can force this situation by, say, taking down nodes, adding 
> replicas that become the leader then starting the nodes back up. But it 
> doesn't matter whether the discrepancy is a result of "pilot error" or a 
> problem with the code, in either case it would be useful to flag it.
> Straw-man proposal:
> We create a processor (modeled on DocExpirationUpdateProcessorFactory 
> perhaps?) that periodically wakes up and checks that each replica in the 
> given shard has the same document count (and perhaps other checks TBD?). Send 
> some kind of notification if a problem was detected.
> Issues:
> 1> this will require some way to deal with the differing commit times. 
> 1a> If we require a timestamp on each document we could check the config file 
> to see the autocommit interval and, say, check NOW-(2 x opensearcher 
> interval). In that case the config would just require the field to use be 
> specified.
> 1b> we could require that part of the configuration is a query to use to 
> check document counts. I kind of like this one.
> 2> How to let the admins know a discrepancy was found? e-mail? ERROR level 
> log message? Other?
> 3> How does this fit into the autoscaling initiative? This is a "monitor the 
> system and do something" item. If we go forward with this we should do it 
> with an eye toward fitting it in that framework.
> 3a> Is there anything we can do to auto-correct this situation? 
> Auto-correction could be tricky. Heuristics like "make the replica with the 
> most documents the leader and force full index replication on all the 
> replicas that don't agree" seem dangerous. 
> 4> How to keep the impact minimal? The simple approach would be for each 
> replica to check all other replicas in the shard. So say there are 10 
> replicas on a single shard, that would be 90 queries. It would suffice for 
> just one of those to check the other 9, not have all 10 check the other nine. 
> Maybe restrict the checker to be the leader? Or otherwise just make it one 
> replica/shard that does the checking?
> 5> It's probably useful to add a collections API call to fire this off 
> manually. Or maybe as part of CHECKSTATUS?
> What do people think?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-10873) Explore a utility for periodically checking the document counts for replicas of a shard

Reply via email to