[ 
https://issues.apache.org/jira/browse/SOLR-7573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573708#comment-14573708
 ] 

Erick Erickson commented on SOLR-7573:
--------------------------------------

Additional data. I have a test harness that I can cause "things to go wrong" 
with fairly regularly, but not on demand. It works like this:

For (some number of configurable cycles)
    Spawn a bunch of threads that create really simple documents and send them 
to the collection
    wait for all the threads to terminate
    commit(true, true)
    for each shard
       check that q=*:* returns the same number of docs found.
          If there is a discrepancy, report and exit


The interesting thing here is that I saw this error, but by the time I could 
investigate via the admin UI, the counts were identical. However, the replica 
that had a smaller count was _also_ forced into leader-initated recovery which 
is a symptom I saw onsite. So the working hypothesis is that the node was in 
LIR for some period but managed to respond to a query. After LIR was over it 
had re-synched and was OK. I'm not clear at all how the replica managed to 
respond, I'll add more logging to see what I can see. I am using HttpSolrClient 
to do the verification with distrib=false so I'm not sure whether the active 
state in ZK matters at all. When I was onsite, the replica didn't recover, but 
we didn't wait very long and restarted it, at which point it did a full sync 
from the leader so it's consistent with what I just saw.

This seems like correct (eventual consistency) behavior, the problem is that 
the replica goes into LIR in the first place. And that it manages to respond to 
a direct ping via HttpSolrClient.

This gives me some hope that if we do SOLR-7571 and have the client(s) keep 
from overwhelming Solr we have a mechanism to at least avoid the situation 
arising in the first place. And if I incorporate that into this test harness 
and the problem goes away it'll give me confidence that we're getting to root 
causes..

> Inconsistent numbers of docs between leader and replica
> -------------------------------------------------------
>
>                 Key: SOLR-7573
>                 URL: https://issues.apache.org/jira/browse/SOLR-7573
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.10.3
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>
> Once again assigning to myself to keep track. And once again not reproducible 
> at will and possible related to firehosing updates to Solr.
> Saw a situation where things seemed to be indexed normally, but the number of 
> docs on a leader and follower were not the same. The leader had, as I 
> remember, a 4.5G index and the follower a 1.9G index. No errors in the logs, 
> no recovery initiated, etc. All nodes green.
> The very curious thing was that when the follower was bounced, it did a full 
> index replication from the leader. How that could be happening without the 
> follower ever going into a recovery state I have no idea.
> Again, if I can get this to reproduce locally I can put more diagnostics into 
> the process and see what I can see. I also have some logs to further explore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to