[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Potter updated SOLR-4260:
---------------------------------

    Attachment: demo_shard1_replicas_out_of_sync.tgz

While doing some other testing of SolrCloud (branch4x - 4.7-SNAPSHOT rev. 
1556055), I hit this issue and here's the kicker ... there were no errors in my 
replica's log, the tlogs are identical, and there was no significant GC 
activity during the time where the replica got out of sync with the leader. I'm 
attaching the data directories (index + tlog) for both replicas 
(demo_shard1_replica1 [leader], and demo_shard1_replica2) and their log files. 
When I do a doc-by-doc comparison of the two indexes, here's the result:

>> finished querying replica1, found 33537 documents (33537)
>> finished querying replica2, found 33528 documents
Doc [82995] not found in replica2: <doc boost="1.0"><field 
name="id">82995</field><field name="string_s">test</field><field 
name="int_i">-274468088</field><field name="float_f">0.90338105</field><field 
name="double_d">0.6949391474539932</field><field name="text_en">this is a 
test</field><field name="_version_">1456683668206518274</field></doc>
Doc [82997] not found in replica2: <doc boost="1.0"><field 
name="id">82997</field><field name="string_s">test</field><field 
name="int_i">301737117</field><field name="float_f">0.6746266</field><field 
name="double_d">0.26034065188918565</field><field name="text_en">this is a 
test</field><field name="_version_">1456683668206518276</field></doc>
Doc [82996] not found in replica2: <doc boost="1.0"><field 
name="id">82996</field><field name="string_s">test</field><field 
name="int_i">-1768315588</field><field name="float_f">0.6641093</field><field 
name="double_d">0.23708033183534993</field><field name="text_en">this is a 
test</field><field name="_version_">1456683668206518275</field></doc>
Doc [82991] not found in replica2: <doc boost="1.0"><field 
name="id">82991</field><field name="string_s">test</field><field 
name="int_i">-2057280061</field><field name="float_f">0.27617514</field><field 
name="double_d">0.7885214691953506</field><field name="text_en">this is a 
test</field><field name="_version_">1456683668206518273</field></doc>
Doc [82987] not found in replica2: <doc boost="1.0"><field 
name="id">82987</field><field name="string_s">test</field><field 
name="int_i">1051456320</field><field name="float_f">0.51863414</field><field 
name="double_d">0.7881255443862878</field><field name="text_en">this is a 
test</field><field name="_version_">1456683668206518272</field></doc>
Doc [82986] not found in replica2: <doc boost="1.0"><field 
name="id">82986</field><field name="string_s">test</field><field 
name="int_i">-1356807889</field><field name="float_f">0.2762279</field><field 
name="double_d">0.003657816979820372</field><field name="text_en">this is a 
test</field><field name="_version_">1456683668205469699</field></doc>
Doc [82984] not found in replica2: <doc boost="1.0"><field 
name="id">82984</field><field name="string_s">test</field><field 
name="int_i">732678870</field><field name="float_f">0.31199205</field><field 
name="double_d">0.9848865821766198</field><field name="text_en">this is a 
test</field><field name="_version_">1456683668205469698</field></doc>
Doc [82970] not found in replica2: <doc boost="1.0"><field 
name="id">82970</field><field name="string_s">test</field><field 
name="int_i">283693979</field><field name="float_f">0.6119651</field><field 
name="double_d">0.04142006867388914</field><field name="text_en">this is a 
test</field><field name="_version_">1456683668205469696</field></doc>
Doc [82973] not found in replica2: <doc boost="1.0"><field 
name="id">82973</field><field name="string_s">test</field><field 
name="int_i">1343103920</field><field name="float_f">0.5855809</field><field 
name="double_d">0.6575904716584224</field><field name="text_en">this is a 
test</field><field name="_version_">1456683668205469697</field></doc>

No amount of committing or reloading of these cores helps. Also, restarting 
replica2 doesn't lead to it being in-sync either, most likely because the tlog 
is identical to the leader? Here's the log messages on replica2 after 
restarting it.

2014-01-08 13:28:20,112 [searcherExecutor-5-thread-1] INFO  solr.core.SolrCore  
- [demo_shard1_replica2] Registered new searcher Searcher@4345de8a 
main{StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356 
_f(4.7):C3381)}
2014-01-08 13:28:21,298 [RecoveryThread] INFO  solr.cloud.RecoveryStrategy  - 
Attempting to PeerSync from 
http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/
 core=demo_shard1_replica2 - recoveringAfterStartup=true
2014-01-08 13:28:21,302 [RecoveryThread] INFO  solr.update.PeerSync  - 
PeerSync: core=demo_shard1_replica2 
url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr START 
replicas=[http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/]
 nUpdates=100
2014-01-08 13:28:21,330 [RecoveryThread] INFO  solr.update.PeerSync  - 
PeerSync: core=demo_shard1_replica2 
url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr  Received 99 
versions from 
ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/
2014-01-08 13:28:21,331 [RecoveryThread] INFO  solr.update.PeerSync  - 
PeerSync: core=demo_shard1_replica2 
url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr  Our versions 
are newer. ourLowThreshold=1456683689417113603 otherHigh=1456683689602711553
2014-01-08 13:28:21,331 [RecoveryThread] INFO  solr.update.PeerSync  - 
PeerSync: core=demo_shard1_replica2 
url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr DONE. sync 
succeeded
2014-01-08 13:28:21,332 [RecoveryThread] INFO  solr.update.UpdateHandler  - 
start 
commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
2014-01-08 13:28:21,332 [RecoveryThread] INFO  solr.update.LoggingInfoStream  - 
[DW][RecoveryThread]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false 
pendingChangesInFullFlush: false
2014-01-08 13:28:21,333 [RecoveryThread] INFO  solr.update.UpdateHandler  - No 
uncommitted changes. Skipping IW.commit.
2014-01-08 13:28:21,334 [RecoveryThread] INFO  solr.update.LoggingInfoStream  - 
[DW][RecoveryThread]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false 
pendingChangesInFullFlush: false
2014-01-08 13:28:21,334 [RecoveryThread] INFO  solr.update.LoggingInfoStream  - 
[IW][RecoveryThread]: nrtIsCurrent: infoVersion matches: true; DW changes: 
false; BD changes: false
2014-01-08 13:28:21,335 [RecoveryThread] INFO  solr.update.LoggingInfoStream  - 
[DW][RecoveryThread]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false 
pendingChangesInFullFlush: false
2014-01-08 13:28:21,335 [RecoveryThread] INFO  solr.search.SolrIndexSearcher  - 
Opening Searcher@5fc2a9d main
2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO  solr.core.SolrCore  
- QuerySenderListener sending requests to Searcher@5fc2a9d 
main{StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356 
_f(4.7):C3381)}
2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO  solr.core.SolrCore  
- QuerySenderListener done.
2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO  solr.core.SolrCore  
- [demo_shard1_replica2] Registered new searcher Searcher@5fc2a9d 
main{StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356 
_f(4.7):C3381)}
2014-01-08 13:28:21,339 [RecoveryThread] INFO  solr.update.UpdateHandler  - 
end_commit_flush
2014-01-08 13:28:21,339 [RecoveryThread] INFO  solr.cloud.RecoveryStrategy  - 
PeerSync Recovery was successful - registering as Active. 
core=demo_shard1_replica2
2014-01-08 13:28:21,339 [RecoveryThread] INFO  solr.cloud.ZkController  - 
publishing core=demo_shard1_replica2 state=active
2014-01-08 13:28:21,370 [main-EventThread] INFO  common.cloud.ZkStateReader  - 
A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 6)


Thus, it would seem there might be some code that's outright losing documents 
(almost feels like a last batch not flushed error but more subtle as it's not 
easy to reproduce this all the time)



> Inconsistent numDocs between leader and replica
> -----------------------------------------------
>
>                 Key: SOLR-4260
>                 URL: https://issues.apache.org/jira/browse/SOLR-4260
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>         Environment: 5.0.0.2013.01.04.15.31.51
>            Reporter: Markus Jelsma
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 5.0, 4.7
>
>         Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png, 
> demo_shard1_replicas_out_of_sync.tgz
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to