[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Timothy Potter updated SOLR-4260: --------------------------------- Attachment: demo_shard1_replicas_out_of_sync.tgz While doing some other testing of SolrCloud (branch4x - 4.7-SNAPSHOT rev. 1556055), I hit this issue and here's the kicker ... there were no errors in my replica's log, the tlogs are identical, and there was no significant GC activity during the time where the replica got out of sync with the leader. I'm attaching the data directories (index + tlog) for both replicas (demo_shard1_replica1 [leader], and demo_shard1_replica2) and their log files. When I do a doc-by-doc comparison of the two indexes, here's the result: >> finished querying replica1, found 33537 documents (33537) >> finished querying replica2, found 33528 documents Doc [82995] not found in replica2: <doc boost="1.0"><field name="id">82995</field><field name="string_s">test</field><field name="int_i">-274468088</field><field name="float_f">0.90338105</field><field name="double_d">0.6949391474539932</field><field name="text_en">this is a test</field><field name="_version_">1456683668206518274</field></doc> Doc [82997] not found in replica2: <doc boost="1.0"><field name="id">82997</field><field name="string_s">test</field><field name="int_i">301737117</field><field name="float_f">0.6746266</field><field name="double_d">0.26034065188918565</field><field name="text_en">this is a test</field><field name="_version_">1456683668206518276</field></doc> Doc [82996] not found in replica2: <doc boost="1.0"><field name="id">82996</field><field name="string_s">test</field><field name="int_i">-1768315588</field><field name="float_f">0.6641093</field><field name="double_d">0.23708033183534993</field><field name="text_en">this is a test</field><field name="_version_">1456683668206518275</field></doc> Doc [82991] not found in replica2: <doc boost="1.0"><field name="id">82991</field><field name="string_s">test</field><field name="int_i">-2057280061</field><field name="float_f">0.27617514</field><field name="double_d">0.7885214691953506</field><field name="text_en">this is a test</field><field name="_version_">1456683668206518273</field></doc> Doc [82987] not found in replica2: <doc boost="1.0"><field name="id">82987</field><field name="string_s">test</field><field name="int_i">1051456320</field><field name="float_f">0.51863414</field><field name="double_d">0.7881255443862878</field><field name="text_en">this is a test</field><field name="_version_">1456683668206518272</field></doc> Doc [82986] not found in replica2: <doc boost="1.0"><field name="id">82986</field><field name="string_s">test</field><field name="int_i">-1356807889</field><field name="float_f">0.2762279</field><field name="double_d">0.003657816979820372</field><field name="text_en">this is a test</field><field name="_version_">1456683668205469699</field></doc> Doc [82984] not found in replica2: <doc boost="1.0"><field name="id">82984</field><field name="string_s">test</field><field name="int_i">732678870</field><field name="float_f">0.31199205</field><field name="double_d">0.9848865821766198</field><field name="text_en">this is a test</field><field name="_version_">1456683668205469698</field></doc> Doc [82970] not found in replica2: <doc boost="1.0"><field name="id">82970</field><field name="string_s">test</field><field name="int_i">283693979</field><field name="float_f">0.6119651</field><field name="double_d">0.04142006867388914</field><field name="text_en">this is a test</field><field name="_version_">1456683668205469696</field></doc> Doc [82973] not found in replica2: <doc boost="1.0"><field name="id">82973</field><field name="string_s">test</field><field name="int_i">1343103920</field><field name="float_f">0.5855809</field><field name="double_d">0.6575904716584224</field><field name="text_en">this is a test</field><field name="_version_">1456683668205469697</field></doc> No amount of committing or reloading of these cores helps. Also, restarting replica2 doesn't lead to it being in-sync either, most likely because the tlog is identical to the leader? Here's the log messages on replica2 after restarting it. 2014-01-08 13:28:20,112 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore - [demo_shard1_replica2] Registered new searcher Searcher@4345de8a main{StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356 _f(4.7):C3381)} 2014-01-08 13:28:21,298 [RecoveryThread] INFO solr.cloud.RecoveryStrategy - Attempting to PeerSync from http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/ core=demo_shard1_replica2 - recoveringAfterStartup=true 2014-01-08 13:28:21,302 [RecoveryThread] INFO solr.update.PeerSync - PeerSync: core=demo_shard1_replica2 url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr START replicas=[http://ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/] nUpdates=100 2014-01-08 13:28:21,330 [RecoveryThread] INFO solr.update.PeerSync - PeerSync: core=demo_shard1_replica2 url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr Received 99 versions from ec2-54-209-223-12.compute-1.amazonaws.com:8984/solr/demo_shard1_replica1/ 2014-01-08 13:28:21,331 [RecoveryThread] INFO solr.update.PeerSync - PeerSync: core=demo_shard1_replica2 url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr Our versions are newer. ourLowThreshold=1456683689417113603 otherHigh=1456683689602711553 2014-01-08 13:28:21,331 [RecoveryThread] INFO solr.update.PeerSync - PeerSync: core=demo_shard1_replica2 url=http://ec2-54-209-97-145.compute-1.amazonaws.com:8984/solr DONE. sync succeeded 2014-01-08 13:28:21,332 [RecoveryThread] INFO solr.update.UpdateHandler - start commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} 2014-01-08 13:28:21,332 [RecoveryThread] INFO solr.update.LoggingInfoStream - [DW][RecoveryThread]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false pendingChangesInFullFlush: false 2014-01-08 13:28:21,333 [RecoveryThread] INFO solr.update.UpdateHandler - No uncommitted changes. Skipping IW.commit. 2014-01-08 13:28:21,334 [RecoveryThread] INFO solr.update.LoggingInfoStream - [DW][RecoveryThread]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false pendingChangesInFullFlush: false 2014-01-08 13:28:21,334 [RecoveryThread] INFO solr.update.LoggingInfoStream - [IW][RecoveryThread]: nrtIsCurrent: infoVersion matches: true; DW changes: false; BD changes: false 2014-01-08 13:28:21,335 [RecoveryThread] INFO solr.update.LoggingInfoStream - [DW][RecoveryThread]: anyChanges? numDocsInRam=0 deletes=false hasTickets:false pendingChangesInFullFlush: false 2014-01-08 13:28:21,335 [RecoveryThread] INFO solr.search.SolrIndexSearcher - Opening Searcher@5fc2a9d main 2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore - QuerySenderListener sending requests to Searcher@5fc2a9d main{StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356 _f(4.7):C3381)} 2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore - QuerySenderListener done. 2014-01-08 13:28:21,338 [searcherExecutor-5-thread-1] INFO solr.core.SolrCore - [demo_shard1_replica2] Registered new searcher Searcher@5fc2a9d main{StandardDirectoryReader(segments_e:38:nrt _d(4.7):C26791 _e(4.7):C3356 _f(4.7):C3381)} 2014-01-08 13:28:21,339 [RecoveryThread] INFO solr.update.UpdateHandler - end_commit_flush 2014-01-08 13:28:21,339 [RecoveryThread] INFO solr.cloud.RecoveryStrategy - PeerSync Recovery was successful - registering as Active. core=demo_shard1_replica2 2014-01-08 13:28:21,339 [RecoveryThread] INFO solr.cloud.ZkController - publishing core=demo_shard1_replica2 state=active 2014-01-08 13:28:21,370 [main-EventThread] INFO common.cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 6) Thus, it would seem there might be some code that's outright losing documents (almost feels like a last batch not flushed error but more subtle as it's not easy to reproduce this all the time) > Inconsistent numDocs between leader and replica > ----------------------------------------------- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 > Reporter: Markus Jelsma > Assignee: Mark Miller > Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png, > demo_shard1_replicas_out_of_sync.tgz > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.5#6160) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org