[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

Timothy Potter (JIRA) Wed, 11 Dec 2013 15:35:28 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845847#comment-13845847
 ]


Timothy Potter commented on SOLR-4260:
--------------------------------------

I don't have fix yet, but I wanted to post an update here to get some feedback 
on what I'm seeing ...

I have a simple SolrCloud configuration setup locally: 1 collection named 
"cloud" with 1 shard and replicationFactor 2, i.e. here's what I use to create 
it:
curl 
"http://localhost:8984/solr/admin/collections?action=CREATE&name=cloud&replicationFactor=$REPFACT&numShards=1&collection.configName=cloud";

The collection gets distributed on two nodes: cloud84:8984 and cloud85:8985 
with cloud84 being assigned the leader.

Here's an outline of the process I used to get my collection out-of-sync during 
indexing:

1) start indexing docs using CloudSolrServer in SolrJ - direct updates go to 
the leader and replica remains in sync for as long as I let this process run
2) kill -9 the process for the replica cloud85
3) let indexing continue against cloud84 for a few seconds (just to get the 
leader and replica out-of-sync once I bring the replica back online)
4) kill -9 the process for the leader cloud84 ... indexing halts of course as 
there are no running servers
5) start the replica cloud85 but do not start the previous leader cloud84

Here are some key log messages as cloud85 - the replica - fires up ... my 
annotations of the log messages are prefixed by [TJP >>

2013-12-11 11:43:22,076 [main-EventThread] INFO  common.cloud.ZkStateReader  - 
A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 1)
2013-12-11 11:43:23,370 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.ShardLeaderElectionContext  - Waiting until we see more replicas up 
for shard shard1: total=2 found=1 timeoutin=139841

[TJP >> This looks good and is expected because cloud85 was not the leader 
before it died, so it should not immediately assume it is the leader until it 
sees more replicas

6) now start the previous leader cloud84 ...

Here are some key log messages from cloud85 as the previous leader cloud84 is 
coming up ... 

2013-12-11 11:43:24,085 [main-EventThread] INFO  common.cloud.ZkStateReader  - 
Updating live nodes... (2)
2013-12-11 11:43:24,136 [main-EventThread] INFO  solr.cloud.DistributedQueue  - 
LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type 
NodeChildrenChanged
2013-12-11 11:43:24,137 [Thread-13] INFO  common.cloud.ZkStateReader  - 
Updating cloud state from ZooKeeper... 
2013-12-11 11:43:24,138 [Thread-13] INFO  solr.cloud.Overseer  - Update state 
numShards=1 message={
  "operation":"state",
  "state":"down",
  "base_url":"http://cloud84:8984/solr";,
  "core":"cloud_shard1_replica2",
  "roles":null,
  "node_name":"cloud84:8984_solr",
  "shard":"shard1",
  "shard_range":null,
  "shard_state":"active",
  "shard_parent":null,
  "collection":"cloud",
  "numShards":"1",
  "core_node_name":"core_node1"}

[TJP >> state of cloud84 looks correct as it is still initializing ...

2013-12-11 11:43:24,140 [main-EventThread] INFO  solr.cloud.DistributedQueue  - 
LatchChildWatcher fired on path: /overseer/queue state: SyncConnected type 
NodeChildrenChanged
2013-12-11 11:43:24,141 [main-EventThread] INFO  common.cloud.ZkStateReader  - 
A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/clusterstate.json, has occurred - updating... (live nodes size: 2)

2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.ShardLeaderElectionContext  - Enough replicas found to continue.

[TJP >> hmmmm ... cloud84 is listed in /live_nodes but it isn't "active" yet or 
even recovering (see state above - it's currently "down") ... My thinking here 
is that the ShardLeaderElectionContext needs to take the state of the replica 
into account before deciding it should continue.


2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.ShardLeaderElectionContext  - I may be the new leader - try and sync
2013-12-11 11:43:25,878 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.SyncStrategy  - Sync replicas to 
http://cloud85:8985/solr/cloud_shard1_replica1/
2013-12-11 11:43:25,880 [coreLoadExecutor-3-thread-1] INFO  
solr.update.PeerSync  - PeerSync: core=cloud_shard1_replica1 
url=http://cloud85:8985/solr START 
replicas=[http://cloud84:8984/solr/cloud_shard1_replica2/] nUpdates=100
2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] WARN  
solr.update.PeerSync  - PeerSync: core=cloud_shard1_replica1 
url=http://cloud85:8985/solr  couldn't connect to 
http://cloud84:8984/solr/cloud_shard1_replica2/, counting as success


[TJP >> whoops! of course it couldn't connect to cloud84 as it's still 
initializing ...


2013-12-11 11:43:25,936 [coreLoadExecutor-3-thread-1] INFO  
solr.update.PeerSync  - PeerSync: core=cloud_shard1_replica1 
url=http://cloud85:8985/solr DONE. sync succeeded
2013-12-11 11:43:25,937 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.SyncStrategy  - Sync Success - now sync replicas to me
2013-12-11 11:43:25,937 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.SyncStrategy  - http://cloud85:8985/solr/cloud_shard1_replica1/: try 
and ask http://cloud84:8984/solr/cloud_shard1_replica2/ to sync
2013-12-11 11:43:25,938 [coreLoadExecutor-3-thread-1] ERROR 
solr.cloud.SyncStrategy  - Sync request error: 
org.apache.solr.client.solrj.SolrServerException: Server refused connection at: 
http://cloud84:8984/solr/cloud_shard1_replica2

[TJP >> ayep, cloud84 is still initializing so it can't respond to you Mr. 
Impatient cloud85!

2013-12-11 11:43:25,939 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.SyncStrategy  - http://cloud85:8985/solr/cloud_shard1_replica1/: 
Sync failed - asking replica (http://cloud84:8984/solr/cloud_shard1_replica2/) 
to recover.
2013-12-11 11:43:25,940 [coreLoadExecutor-3-thread-1] INFO  
solr.cloud.ShardLeaderElectionContext  - I am the new leader: 
http://cloud85:8985/solr/cloud_shard1_replica1/ shard1

[TJP >> oh no! the collection is now out-of-sync ... my test harness 
periodically polls the replicas for their doc counts and at this point, we 
ended up with:
shard1: {
        http://cloud85:8985/solr/cloud_shard1_replica1/ = 300800 LEADER
        http://cloud84:8984/solr/cloud_shard1_replica2/ = 447600 diff:-146800  
<--- this should be the real leader!
}

Which of course is expected because cloud85 should *NOT* be the leader

So all that is interesting, but how to fix???

My first idea was to go tackle the decision making process 
ShardLeaderElectionContext uses to decide if it has enough replicas to 
continue. 

It's easy enough to do something like the following:
        int notDownCount = 0;
        Map<String,Replica> replicasMap = slices.getReplicasMap();
        for (Replica replica : replicasMap.values()) {
          ZkCoreNodeProps replicaCoreProps = new ZkCoreNodeProps(replica);
          String replicaState = replicaCoreProps.getState();
          log.warn(">>>> State of replica "+replica.getName()+" is 
"+replicaState+" <<<<");
          if ("active".equals(replicaState) || 
"recovering".equals(replicaState)) {
            ++notDownCount;
          }
        }

Was thinking I could use the notDownCount to make a better decision, but then I 
ran into another issue related to replica state being stale. In my cluster, if 
I have /clusterstate.json:

{"cloud":{
    "shards":{"shard1":{
        "range":"80000000-7fffffff",
        "state":"active",
        "replicas":{
          "core_node1":{
            "state":"active",
            "base_url":"http://cloud84:8984/solr";,
            "core":"cloud_shard1_replica2",
            "node_name":"cloud84:8984_solr",
            "leader":"true"},
          "core_node2":{
            "state":"active",
            "base_url":"http://cloud85:8985/solr";,
            "core":"cloud_shard1_replica1",
            "node_name":"cloud85:8985_solr"}}}},
    "maxShardsPerNode":"1",
    "router":{"name":"compositeId"},
    "replicationFactor":"2"}}

If I kill the process using kill -9 PID for the Solr running on 8985 (the 
replica), core_node2's state remains "active" in /clusterstate.json

When tailing the log on core_node1, I do see one notification coming in the 
watcher setup by ZkStateReader from ZooKeeper about live nodes having changed:
2013-12-11 15:42:46,010 [main-EventThread] INFO  common.cloud.ZkStateReader  - 
Updating live nodes... (1)

So after killing the process, /live_nodes is updated to only have one node, but 
/clusterstate.json still thinks there are 2 healthy replicas for shard1, 
instead of just 1.

Of course, if I restart 8985, then it goes through a series of state changes 
until it is marked active again, which looks correct.

Bottom line ... it seems there is something in SolrCloud that does not update a 
replica's state when the node is killed. If a change to /live_nodes doesn't 
trigger a refresh of replica state, what does?

I'm seeing this stale replica state issue in Solr 4.6.0 and in revision 1550300 
of branch_4x - the latest from svn.

Not having a fresh state of a replica prevents my idea for fixing 
ShardLeaderElectionContext's decision making process. I'm also curious about 
the decision to register a node under /live_nodes before it is fully 
initialized, but maybe that is a discussion for another time.

In any case, I wanted to get some feedback on my findings before moving forward 
with a solution.

> Inconsistent numDocs between leader and replica
> -----------------------------------------------
>
>                 Key: SOLR-4260
>                 URL: https://issues.apache.org/jira/browse/SOLR-4260
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>         Environment: 5.0.0.2013.01.04.15.31.51
>            Reporter: Markus Jelsma
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: 5.0, 4.7
>
>         Attachments: 192.168.20.102-replica1.png, 
> 192.168.20.104-replica2.png, clusterstate.png
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-4260) Inconsistent numDocs between leader and replica

Reply via email to