[
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869916#comment-13869916
]
Timothy Potter commented on SOLR-4260:
--------------------------------------
Make sense about not waiting because of the penalty now that I've had a chance
to get into the details of that code.
I spent a lot of time on Friday and over the weekend trying to track down the
docs getting dropped. Unfortunately have not been able to track down the source
of the issue yet. I'm fairly certain the issue happens before docs get
submitted to CUSS, meaning that the lost docs never seemed to hit the queue in
ConcurrentUpdateSolrServer. My original thinking was that given the complex
nature of CUSS, there might be some sort of race condition but after having
added a log of what hit the queue, it seems that the documents that get lost
never hit the queue. Not to mention that the actual use of CUSS is mostly
single-threaded because StreamingSolrServers construct them with a threadCount
of 1.
As a side note, one thing I noticed while is that direct updates don't
necessarily hit the correct core initially when a Solr node hosts more than one
shard per collection. In other words, if host X had shard1 and shard3 of
collection foo, then some update requests would hit shard1 on host X when they
should go to shard3 on the same host; shard1 correctly forwards them on but
it's still an extra hop. Of course that is probably not a big deal in
production as it would be rare to host multiple shards of the same collection
in the same Solr host, unless they are over-sharding.
In terms of this issue, here's what I'm seeing:
Assume a SolrCloud environment with shard1 having replicas on host A and B; A
is the current leader
client sends direct update request to shard1 on host A containing 3 docs
(1,2,3) (for example)
batch from client gets broken up into individual docs (during request parsing)
docs 1,2,3 get indexed on host A (the leader)
docs 1 and 2 get queued into CUSS and sent on to the replica on host B
(sometimes in the same request, sometimes in separate requests)
doc 3 never makes it and from what I can tell, never hits the queue
This may be anecdotal but from what I can tell, it's always docs on the end of
a batch and not in the middle. Meaning that I haven't seen a case where 1 and 3
make it and 2 not ... maybe useful, maybe not. The only other thing I'll
mention is it does seem timing / race condition related as it's almost
impossible to reproduce this on my Mac when running 2 shards across 2 nodes but
much easier to trigger if I ramp up to say 8 shards on 2 nodes, i.e. the busier
my CPU is, the easier it is to see docs getting dropped.
> Inconsistent numDocs between leader and replica
> -----------------------------------------------
>
> Key: SOLR-4260
> URL: https://issues.apache.org/jira/browse/SOLR-4260
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Environment: 5.0.0.2013.01.04.15.31.51
> Reporter: Markus Jelsma
> Assignee: Mark Miller
> Priority: Critical
> Fix For: 5.0, 4.7
>
> Attachments: 192.168.20.102-replica1.png,
> 192.168.20.104-replica2.png, clusterstate.png,
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using
> CloudSolrServer we see inconsistencies between the leader and replica for
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have
> a small deviation in then number of documents. The leader and slave deviate
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my
> attention, there were small IDF differences for exactly the same record
> causing a record to shift positions in the result set. During those tests no
> records were indexed. Consecutive catch all queries also return different
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor
> of two and frequently reindex using a fresh build from trunk. I've not seen
> this issue for quite some time until a few days ago.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]