[ https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845847#comment-13845847 ]
Timothy Potter commented on SOLR-4260: -------------------------------------- I don't have fix yet, but I wanted to post an update here to get some feedback on what I'm seeing ... I have a simple SolrCloud configuration setup locally: 1 collection named "cloud" with 1 shard and replicationFactor 2, i.e. here's what I use to create it: curl "http://localhost:8984/solr/admin/collections?action=CREATE&name=cloud&replicationFactor=$REPFACT&numShards=1&collection.configName=cloud" The collection gets distributed on two nodes: cloud84:8984 and cloud85:8985 with cloud84 being assigned the leader. Here's an outline of the process I used to get my collection out-of-sync during indexing: 1) start indexing docs using CloudSolrServer in SolrJ - direct updates go to the leader and replica remains in sync for as long as I let this process run 2) kill -9 the process for the replica cloud85 3) let indexing continue against cloud84 for a few seconds (just to get the leader and replica out-of-sync once I bring the replica back online) 4) kill -9 the process for the leader cloud84 ... indexing halts of course as there are no running servers 5) start the replica cloud85 but do not start the previous leader cloud84 Here are some key log messages as cloud85 - the replica - fires up ... my annotations of the log messages are prefixed by [TJP >> 2013-12-11 11:43:22,076 [main-EventThread] INFO common.cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 1) 2013-12-11 11:43:23,370 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - Waiting until we see more replicas up for shard shard1: total=2 found=1 timeoutin=139841 [TJP >> This looks good and is expected because cloud85 was not the leader before it died, so it should not immediately assume it is the leader until it sees more replicas 6) now start the previous leader cloud84 ... Here are some key log messages from cloud85 as the previous leader cloud84 is coming up ... 2013-12-11 11:43:24,085 [main-EventThread] INFO common.cloud.ZkStateReader - Updating live nodes... (2) 2013-12-11 11:43:24,136 [main-EventThread] INFO solr.cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 2013-12-11 11:43:24,137 [Thread-13] INFO common.cloud.ZkStateReader - Updating cloud state from ZooKeeper... 2013-12-11 11:43:24,138 [Thread-13] INFO solr.cloud.Overseer - Update state numShards=1 message={ "operation":"state", "state":"down", "base_url":"http://cloud84:8984/solr", "core":"cloud_shard1_replica2", "roles":null, "node_name":"cloud84:8984_solr", "shard":"shard1", "shard_range":null, "shard_state":"active", "shard_parent":null, "collection":"cloud", "numShards":"1", "core_node_name":"core_node1"} [TJP >> state of cloud84 looks correct as it is still initializing ... 2013-12-11 11:43:24,140 [main-EventThread] INFO solr.cloud.DistributedQueue - LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type NodeChildrenChanged 2013-12-11 11:43:24,141 [main-EventThread] INFO common.cloud.ZkStateReader - A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged path:/clusterstate.json, has occurred - updating... (live nodes size: 2) 2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - Enough replicas found to continue. [TJP >> hmmmm ... cloud84 is listed in /live_nodes but it isn't "active" yet or even recovering (see state above - it's currently "down") ... My thinking here is that the ShardLeaderElectionContext needs to take the state of the replica into account before deciding it should continue. 2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - I may be the new leader - try and sync 2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO solr.cloud.SyncStrategy - Sync replicas to http://cloud85:8985/solr/cloud_shard1_replica1/ 2013-12-11 11:43:25,880 [coreLoadExecutor-3-thread-1] INFO solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url=http://cloud85:8985/solr START replicas=[http://cloud84:8984/solr/cloud_shard1_replica2/] nUpdates=100 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url=http://cloud85:8985/solr couldn't connect to http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success [TJP >> whoops! of course it couldn't connect to cloud84 as it's still initializing ... 2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] INFO solr.update.PeerSync - PeerSync: core=cloud_shard1_replica1 url=http://cloud85:8985/solr DONE. sync succeeded 2013-12-11 11:43:25,937 [coreLoadExecutor-3-thread-1] INFO solr.cloud.SyncStrategy - Sync Success - now sync replicas to me 2013-12-11 11:43:25,937 [coreLoadExecutor-3-thread-1] INFO solr.cloud.SyncStrategy - http://cloud85:8985/solr/cloud_shard1_replica1/: try and ask http://cloud84:8984/solr/cloud_shard1_replica2/ to sync 2013-12-11 11:43:25,938 [coreLoadExecutor-3-thread-1] ERROR solr.cloud.SyncStrategy - Sync request error: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://cloud84:8984/solr/cloud_shard1_replica2 [TJP >> ayep, cloud84 is still initializing so it can't respond to you Mr. Impatient cloud85! 2013-12-11 11:43:25,939 [coreLoadExecutor-3-thread-1] INFO solr.cloud.SyncStrategy - http://cloud85:8985/solr/cloud_shard1_replica1/: Sync failed - asking replica (http://cloud84:8984/solr/cloud_shard1_replica2/) to recover. 2013-12-11 11:43:25,940 [coreLoadExecutor-3-thread-1] INFO solr.cloud.ShardLeaderElectionContext - I am the new leader: http://cloud85:8985/solr/cloud_shard1_replica1/ shard1 [TJP >> oh no! the collection is now out-of-sync ... my test harness periodically polls the replicas for their doc counts and at this point, we ended up with: shard1: { http://cloud85:8985/solr/cloud_shard1_replica1/ = 300800 LEADER http://cloud84:8984/solr/cloud_shard1_replica2/ = 447600 diff:-146800 <--- this should be the real leader! } Which of course is expected because cloud85 should *NOT* be the leader So all that is interesting, but how to fix??? My first idea was to go tackle the decision making process ShardLeaderElectionContext uses to decide if it has enough replicas to continue. It's easy enough to do something like the following: int notDownCount = 0; Map<String,Replica> replicasMap = slices.getReplicasMap(); for (Replica replica : replicasMap.values()) { ZkCoreNodeProps replicaCoreProps = new ZkCoreNodeProps(replica); String replicaState = replicaCoreProps.getState(); log.warn(">>>> State of replica "+replica.getName()+" is "+replicaState+" <<<<"); if ("active".equals(replicaState) || "recovering".equals(replicaState)) { ++notDownCount; } } Was thinking I could use the notDownCount to make a better decision, but then I ran into another issue related to replica state being stale. In my cluster, if I have /clusterstate.json: {"cloud":{ "shards":{"shard1":{ "range":"80000000-7fffffff", "state":"active", "replicas":{ "core_node1":{ "state":"active", "base_url":"http://cloud84:8984/solr", "core":"cloud_shard1_replica2", "node_name":"cloud84:8984_solr", "leader":"true"}, "core_node2":{ "state":"active", "base_url":"http://cloud85:8985/solr", "core":"cloud_shard1_replica1", "node_name":"cloud85:8985_solr"}}}}, "maxShardsPerNode":"1", "router":{"name":"compositeId"}, "replicationFactor":"2"}} If I kill the process using kill -9 PID for the Solr running on 8985 (the replica), core_node2's state remains "active" in /clusterstate.json When tailing the log on core_node1, I do see one notification coming in the watcher setup by ZkStateReader from ZooKeeper about live nodes having changed: 2013-12-11 15:42:46,010 [main-EventThread] INFO common.cloud.ZkStateReader - Updating live nodes... (1) So after killing the process, /live_nodes is updated to only have one node, but /clusterstate.json still thinks there are 2 healthy replicas for shard1, instead of just 1. Of course, if I restart 8985, then it goes through a series of state changes until it is marked active again, which looks correct. Bottom line ... it seems there is something in SolrCloud that does not update a replica's state when the node is killed. If a change to /live_nodes doesn't trigger a refresh of replica state, what does? I'm seeing this stale replica state issue in Solr 4.6.0 and in revision 1550300 of branch_4x - the latest from svn. Not having a fresh state of a replica prevents my idea for fixing ShardLeaderElectionContext's decision making process. I'm also curious about the decision to register a node under /live_nodes before it is fully initialized, but maybe that is a discussion for another time. In any case, I wanted to get some feedback on my findings before moving forward with a solution. > Inconsistent numDocs between leader and replica > ----------------------------------------------- > > Key: SOLR-4260 > URL: https://issues.apache.org/jira/browse/SOLR-4260 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Environment: 5.0.0.2013.01.04.15.31.51 > Reporter: Markus Jelsma > Assignee: Mark Miller > Priority: Critical > Fix For: 5.0, 4.7 > > Attachments: 192.168.20.102-replica1.png, > 192.168.20.104-replica2.png, clusterstate.png > > > After wiping all cores and reindexing some 3.3 million docs from Nutch using > CloudSolrServer we see inconsistencies between the leader and replica for > some shards. > Each core hold about 3.3k documents. For some reason 5 out of 10 shards have > a small deviation in then number of documents. The leader and slave deviate > for roughly 10-20 documents, not more. > Results hopping ranks in the result set for identical queries got my > attention, there were small IDF differences for exactly the same record > causing a record to shift positions in the result set. During those tests no > records were indexed. Consecutive catch all queries also return different > number of numDocs. > We're running a 10 node test cluster with 10 shards and a replication factor > of two and frequently reindex using a fresh build from trunk. I've not seen > this issue for quite some time until a few days ago. -- This message was sent by Atlassian JIRA (v6.1.4#6159) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org