[
https://issues.apache.org/jira/browse/SOLR-7573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573708#comment-14573708
]
Erick Erickson commented on SOLR-7573:
--------------------------------------
Additional data. I have a test harness that I can cause "things to go wrong"
with fairly regularly, but not on demand. It works like this:
For (some number of configurable cycles)
Spawn a bunch of threads that create really simple documents and send them
to the collection
wait for all the threads to terminate
commit(true, true)
for each shard
check that q=*:* returns the same number of docs found.
If there is a discrepancy, report and exit
The interesting thing here is that I saw this error, but by the time I could
investigate via the admin UI, the counts were identical. However, the replica
that had a smaller count was _also_ forced into leader-initated recovery which
is a symptom I saw onsite. So the working hypothesis is that the node was in
LIR for some period but managed to respond to a query. After LIR was over it
had re-synched and was OK. I'm not clear at all how the replica managed to
respond, I'll add more logging to see what I can see. I am using HttpSolrClient
to do the verification with distrib=false so I'm not sure whether the active
state in ZK matters at all. When I was onsite, the replica didn't recover, but
we didn't wait very long and restarted it, at which point it did a full sync
from the leader so it's consistent with what I just saw.
This seems like correct (eventual consistency) behavior, the problem is that
the replica goes into LIR in the first place. And that it manages to respond to
a direct ping via HttpSolrClient.
This gives me some hope that if we do SOLR-7571 and have the client(s) keep
from overwhelming Solr we have a mechanism to at least avoid the situation
arising in the first place. And if I incorporate that into this test harness
and the problem goes away it'll give me confidence that we're getting to root
causes..
> Inconsistent numbers of docs between leader and replica
> -------------------------------------------------------
>
> Key: SOLR-7573
> URL: https://issues.apache.org/jira/browse/SOLR-7573
> Project: Solr
> Issue Type: Bug
> Affects Versions: 4.10.3
> Reporter: Erick Erickson
> Assignee: Erick Erickson
>
> Once again assigning to myself to keep track. And once again not reproducible
> at will and possible related to firehosing updates to Solr.
> Saw a situation where things seemed to be indexed normally, but the number of
> docs on a leader and follower were not the same. The leader had, as I
> remember, a 4.5G index and the follower a 1.9G index. No errors in the logs,
> no recovery initiated, etc. All nodes green.
> The very curious thing was that when the follower was bounced, it did a full
> index replication from the leader. How that could be happening without the
> follower ever going into a recovery state I have no idea.
> Again, if I can get this to reproduce locally I can put more diagnostics into
> the process and see what I can see. I also have some logs to further explore.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]