Timothy Potter created SOLR-7338:
------------------------------------

             Summary: A reloaded core will never register itself as active 
after a ZK session expiration
                 Key: SOLR-7338
                 URL: https://issues.apache.org/jira/browse/SOLR-7338
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
            Reporter: Timothy Potter
            Assignee: Timothy Potter


If a collection gets reloaded, then a core's isReloaded flag is always true. If 
a core experiences a ZK session expiration after a reload, then it won't ever 
be able to set itself to active because of the check in 
{{ZkController#register}}:

{code}
        UpdateLog ulog = core.getUpdateHandler().getUpdateLog();
        if (!core.isReloaded() && ulog != null) {
          // disable recovery in case shard is in construction state (for shard 
splits)
          Slice slice = getClusterState().getSlice(collection, shardId);
          if (slice.getState() != Slice.State.CONSTRUCTION || !isLeader) {
            Future<UpdateLog.RecoveryInfo> recoveryFuture = 
core.getUpdateHandler().getUpdateLog().recoverFromLog();
            if (recoveryFuture != null) {
              log.info("Replaying tlog for " + ourUrl + " during startup... 
NOTE: This can take a while.");
              recoveryFuture.get(); // NOTE: this could potentially block for
              // minutes or more!
              // TODO: public as recovering in the mean time?
              // TODO: in the future we could do peersync in parallel with 
recoverFromLog
            } else {
              log.info("No LogReplay needed for core=" + core.getName() + " 
baseURL=" + baseUrl);
            }
          }
          boolean didRecovery = checkRecovery(coreName, desc, 
recoverReloadedCores, isLeader, cloudDesc,
              collection, coreZkNodeName, shardId, leaderProps, core, cc);
          if (!didRecovery) {
            publish(desc, ZkStateReader.ACTIVE);
          }
        }
{code}

I can easily simulate this on trunk by doing:

{code}
bin/solr -c -z localhost:2181
bin/solr create -c foo
bin/post -c foo example/exampledocs/*.xml
curl "http://localhost:8983/solr/admin/collections?action=RELOAD&name=foo";
kill -STOP <PID> && sleep <PAUSE_SECONDS> && kill -CONT <PID>
{code}

Where <PID> is the process ID of the Solr node. Here are the logs after the 
CONT command. As you can see below, the core never gets to setting itself as 
active again. I think the bug is that the isReloaded flag needs to get set back 
to false once the reload is successful, but I don't understand what this flag 
is needed for anyway???

{code}
INFO  - 2015-04-01 17:28:50.962; 
org.apache.solr.common.cloud.ConnectionManager; Watcher 
org.apache.solr.common.cloud.ConnectionManager@5519dba0 
name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
state:Disconnected type:None path:null path:null type:None
INFO  - 2015-04-01 17:28:50.963; 
org.apache.solr.common.cloud.ConnectionManager; zkClient has disconnected
INFO  - 2015-04-01 17:28:51.107; 
org.apache.solr.common.cloud.ConnectionManager; Watcher 
org.apache.solr.common.cloud.ConnectionManager@5519dba0 
name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
state:Expired type:None path:null path:null type:None
INFO  - 2015-04-01 17:28:51.107; 
org.apache.solr.common.cloud.ConnectionManager; Our previous ZooKeeper session 
was expired. Attempting to reconnect to recover relationship with ZooKeeper...
INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.Overseer; Overseer 
(id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) closing
INFO  - 2015-04-01 17:28:51.108; 
org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
/configs/foo
INFO  - 2015-04-01 17:28:51.108; 
org.apache.solr.cloud.Overseer$ClusterStateUpdater; Overseer Loop exiting : 
192.168.1.2:8983_solr
INFO  - 2015-04-01 17:28:51.109; 
org.apache.solr.cloud.OverseerCollectionProcessor; According to ZK I 
(id=93579450724974592-192.168.1.2:8983_solr-n_0000000000) am no longer a leader.
INFO  - 2015-04-01 17:28:51.108; org.apache.solr.cloud.ZkController$4; Running 
listeners for /configs/foo
INFO  - 2015-04-01 17:28:51.109; 
org.apache.solr.common.cloud.DefaultConnectionStrategy; Connection expired - 
starting a new one...
INFO  - 2015-04-01 17:28:51.109; org.apache.solr.core.SolrCore$11; config 
update listener called for core foo_shard1_replica1
INFO  - 2015-04-01 17:28:51.110; 
org.apache.solr.common.cloud.ConnectionManager; Waiting for client to connect 
to ZooKeeper
ERROR - 2015-04-01 17:28:51.110; org.apache.solr.common.SolrException; 
OverseerAutoReplicaFailoverThread had an error in its thread work 
loop.:org.apache.solr.common.SolrException: Error reading cluster properties
        at 
org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:772)
        at 
org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:150)
        at 
org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:129)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException: sleep interrupted
        at java.lang.Thread.sleep(Native Method)
        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108)
        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76)
        at 
org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308)
        at 
org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:765)
        ... 3 more

WARN  - 2015-04-01 17:28:51.110; org.apache.solr.cloud.ZkController$4; listener 
throws error
org.apache.solr.common.SolrException: 
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /configs/foo/params.json
        at 
org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:160)
        at 
org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:907)
        at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2503)
        at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2351)
Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /configs/foo/params.json
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
        at 
org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294)
        at 
org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291)
        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
        at 
org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291)
        at 
org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:150)
        ... 3 more
ERROR - 2015-04-01 17:28:51.110; 
org.apache.solr.cloud.Overseer$ClusterStateUpdater; could not read the data
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /overseer_elect/leader
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
        at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
        at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
        at 
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
        at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
        at 
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
        at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.checkIfIamStillLeader(Overseer.java:298)
        at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater.access$300(Overseer.java:83)
        at 
org.apache.solr.cloud.Overseer$ClusterStateUpdater$2.run(Overseer.java:259)
INFO  - 2015-04-01 17:28:51.114; 
org.apache.solr.common.cloud.ConnectionManager; Watcher 
org.apache.solr.common.cloud.ConnectionManager@5519dba0 
name:ZooKeeperConnection Watcher:localhost:2181 got event WatchedEvent 
state:SyncConnected type:None path:null path:null type:None
INFO  - 2015-04-01 17:28:51.115; 
org.apache.solr.common.cloud.ConnectionManager; Client is connected to ZooKeeper
INFO  - 2015-04-01 17:28:51.115; 
org.apache.solr.common.cloud.ConnectionManager$1; Connection with ZooKeeper 
reestablished.
INFO  - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController$1; 
ZooKeeper session re-connected ... refreshing core states after session 
expiration.
INFO  - 2015-04-01 17:28:51.115; org.apache.solr.cloud.ZkController; publishing 
core=foo_shard1_replica1 state=down collection=foo
INFO  - 2015-04-01 17:28:51.118; org.apache.solr.cloud.ElectionContext; 
canceling election 
/overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000
WARN  - 2015-04-01 17:28:51.119; org.apache.solr.cloud.ElectionContext; 
cancelElection did not find election node to remove 
/overseer_elect/election/93579450724974592-192.168.1.2:8983_solr-n_0000000000
INFO  - 2015-04-01 17:28:51.121; org.apache.solr.cloud.OverseerElectionContext; 
I am going to be the leader 192.168.1.2:8983_solr
INFO  - 2015-04-01 17:28:51.121; org.apache.solr.common.cloud.SolrZkClient; 
makePath: /overseer_elect/leader
INFO  - 2015-04-01 17:28:51.122; org.apache.solr.cloud.Overseer; Overseer 
(id=93579450724974594-192.168.1.2:8983_solr-n_0000000001) starting
INFO  - 2015-04-01 17:28:51.128; 
org.apache.solr.cloud.OverseerAutoReplicaFailoverThread; Starting 
OverseerAutoReplicaFailoverThread autoReplicaFailoverWorkLoopDelay=10000 
autoReplicaFailoverWaitAfterExpiration=30000 
autoReplicaFailoverBadNodeExpiration=60000
INFO  - 2015-04-01 17:28:51.128; 
org.apache.solr.cloud.OverseerCollectionProcessor; Process current queue of 
collection creations
INFO  - 2015-04-01 17:28:51.128; org.apache.solr.common.cloud.ZkStateReader; 
Updating cluster state from ZooKeeper... 
INFO  - 2015-04-01 17:28:51.129; 
org.apache.solr.cloud.Overseer$ClusterStateUpdater; Starting to work on the 
main queue
INFO  - 2015-04-01 17:28:51.130; org.apache.solr.common.cloud.ZkStateReader; 
addZkWatch foo
INFO  - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; 
Updating collection state at /collections/foo/state.json from ZooKeeper... 
INFO  - 2015-04-01 17:28:51.131; org.apache.solr.common.cloud.ZkStateReader; 
Updating data for foo to ver 4 
INFO  - 2015-04-01 17:28:51.131; org.apache.solr.cloud.ZkController; Register 
node as live in ZooKeeper:/live_nodes/192.168.1.2:8983_solr
INFO  - 2015-04-01 17:28:51.132; org.apache.solr.common.cloud.SolrZkClient; 
makePath: /live_nodes/192.168.1.2:8983_solr
INFO  - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ZkController; Register 
replica - core:foo_shard1_replica1 address:http://192.168.1.2:8983/solr 
collection:foo shard:shard1
INFO  - 2015-04-01 17:28:51.133; org.apache.solr.common.cloud.ZkStateReader; 
Updating data for foo to ver 4 
INFO  - 2015-04-01 17:28:51.133; org.apache.solr.cloud.ElectionContext; 
canceling election 
/collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000
WARN  - 2015-04-01 17:28:51.134; org.apache.solr.cloud.ElectionContext; 
cancelElection did not find election node to remove 
/collections/foo/leader_elect/shard1/election/93579450724974592-core_node1-n_0000000000
INFO  - 2015-04-01 17:28:51.134; 
org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 
1, message = {
  "core":"foo_shard1_replica1",
  "core_node_name":"core_node1",
  "roles":null,
  "base_url":"http://192.168.1.2:8983/solr";,
  "node_name":"192.168.1.2:8983_solr",
  "numShards":"1",
  "state":"down",
  "shard":"shard1",
  "collection":"foo",
  "operation":"state"} current state version: 1
INFO  - 2015-04-01 17:28:51.135; org.apache.solr.cloud.overseer.ReplicaMutator; 
Update state numShards=1 message={
  "core":"foo_shard1_replica1",
  "core_node_name":"core_node1",
  "roles":null,
  "base_url":"http://192.168.1.2:8983/solr";,
  "node_name":"192.168.1.2:8983_solr",
  "numShards":"1",
  "state":"down",
  "shard":"shard1",
  "collection":"foo",
  "operation":"state"}
INFO  - 2015-04-01 17:28:51.136; 
org.apache.solr.cloud.ShardLeaderElectionContext; Running the leader process 
for shard shard1
INFO  - 2015-04-01 17:28:51.136; org.apache.solr.cloud.overseer.ZkStateWriter; 
going to update_collection /collections/foo/state.json version: 4
INFO  - 2015-04-01 17:28:51.136; org.apache.solr.cloud.ActionThrottle; The last 
leader attempt started 198060ms ago.
INFO  - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader$7; 
A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/collections/foo/state.json for collection foo has occurred - updating... 
(live nodes size: 1)
INFO  - 2015-04-01 17:28:51.137; org.apache.solr.common.cloud.ZkStateReader; 
Updating data for foo to ver 5 
INFO  - 2015-04-01 17:28:51.138; 
org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged fired 
on path /overseer/queue state SyncConnected
INFO  - 2015-04-01 17:28:51.138; 
org.apache.solr.cloud.ShardLeaderElectionContext; Enough replicas found to 
continue.
INFO  - 2015-04-01 17:28:51.138; 
org.apache.solr.cloud.ShardLeaderElectionContext; I may be the new leader - try 
and sync
INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync 
replicas to http://192.168.1.2:8983/solr/foo_shard1_replica1/
INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; Sync 
Success - now sync replicas to me
INFO  - 2015-04-01 17:28:51.138; org.apache.solr.cloud.SyncStrategy; 
http://192.168.1.2:8983/solr/foo_shard1_replica1/ has no replicas
INFO  - 2015-04-01 17:28:51.139; 
org.apache.solr.cloud.ShardLeaderElectionContext; I am the new leader: 
http://192.168.1.2:8983/solr/foo_shard1_replica1/ shard1
INFO  - 2015-04-01 17:28:51.139; org.apache.solr.common.cloud.SolrZkClient; 
makePath: /collections/foo/leaders/shard1
INFO  - 2015-04-01 17:28:51.139; 
org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 
1, message = {
  "operation":"leader",
  "shard":"shard1",
  "collection":"foo"} current state version: 1
INFO  - 2015-04-01 17:28:51.140; org.apache.solr.cloud.overseer.ZkStateWriter; 
going to update_collection /collections/foo/state.json version: 5
INFO  - 2015-04-01 17:28:51.141; org.apache.solr.common.cloud.ZkStateReader$7; 
A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/collections/foo/state.json for collection foo has occurred - updating... 
(live nodes size: 1)
INFO  - 2015-04-01 17:28:51.142; 
org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged fired 
on path /overseer/queue state SyncConnected
INFO  - 2015-04-01 17:28:51.142; org.apache.solr.common.cloud.ZkStateReader; 
Updating data for foo to ver 6 
INFO  - 2015-04-01 17:28:51.144; 
org.apache.solr.cloud.Overseer$ClusterStateUpdater; processMessage: queueSize: 
1, message = {
  "operation":"leader",
  "shard":"shard1",
  "collection":"foo",
  "base_url":"http://192.168.1.2:8983/solr";,
  "core":"foo_shard1_replica1",
  "state":"active"} current state version: 1
INFO  - 2015-04-01 17:28:51.145; org.apache.solr.cloud.overseer.ZkStateWriter; 
going to update_collection /collections/foo/state.json version: 6
INFO  - 2015-04-01 17:28:51.145; org.apache.solr.common.cloud.ZkStateReader$7; 
A cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
path:/collections/foo/state.json for collection foo has occurred - updating... 
(live nodes size: 1)
INFO  - 2015-04-01 17:28:51.146; org.apache.solr.common.cloud.ZkStateReader; 
Updating data for foo to ver 7 
INFO  - 2015-04-01 17:28:51.147; 
org.apache.solr.cloud.DistributedQueue$LatchWatcher; NodeChildrenChanged fired 
on path /overseer/queue state SyncConnected
INFO  - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; We are 
http://192.168.1.2:8983/solr/foo_shard1_replica1/ and leader is 
http://192.168.1.2:8983/solr/foo_shard1_replica1/
INFO  - 2015-04-01 17:28:51.196; org.apache.solr.cloud.ZkController; 

 In register, core.isReloaded? true 


INFO  - 2015-04-01 17:28:51.199; org.apache.solr.common.cloud.ZkStateReader; 
Updating data for foo to ver 7 
INFO  - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController; watch 
zkdir /configs/foo
INFO  - 2015-04-01 17:28:51.199; org.apache.solr.schema.ZkIndexSchemaReader; 
Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema
INFO  - 2015-04-01 17:28:51.199; org.apache.solr.cloud.ZkController$4; Running 
listeners for /configs/foo
INFO  - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrCore$11; config 
update listener called for core foo_shard1_replica1
INFO  - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; 
Current schema version 5 is already the latest
INFO  - 2015-04-01 17:28:51.200; org.apache.solr.schema.ZkIndexSchemaReader; 
Creating ZooKeeper watch for the managed schema at /configs/foo/managed-schema
INFO  - 2015-04-01 17:28:51.200; org.apache.solr.core.SolrConfig; current 
version of requestparams : 0
INFO  - 2015-04-01 17:28:51.201; org.apache.solr.schema.ZkIndexSchemaReader; 
Current schema version 5 is already the latest
INFO  - 2015-04-01 17:28:51.201; 
org.apache.solr.common.cloud.DefaultConnectionStrategy; Reconnected to ZooKeeper
INFO  - 2015-04-01 17:28:51.202; 
org.apache.solr.common.cloud.ConnectionManager; Connected:true
INFO  - 2015-04-01 17:28:51.211; 
org.apache.solr.cloud.ZkController$WatcherImpl; A node got unwatched for 
/configs/foo
INFO  - 2015-04-01 17:28:51.211; org.apache.solr.cloud.ZkController$4; Running 
listeners for /configs/foo
INFO  - 2015-04-01 17:28:51.211; org.apache.solr.core.SolrCore$11; config 
update listener called for core foo_shard1_replica1
INFO  - 2015-04-01 17:28:51.212; org.apache.solr.core.SolrConfig; current 
version of requestparams : 0
INFO  - 2015-04-01 17:29:07.080; org.apache.solr.servlet.SolrDispatchFilter; 
[admin] webapp=null path=/admin/cores 
params={indexInfo=false&wt=json&_=1427909347078} status=0 QTime=0 
INFO  - 2015-04-01 17:29:07.112; org.apache.solr.servlet.SolrDispatchFilter; 
[admin] webapp=null path=/admin/info/system params={wt=json&_=1427909347099} 
status=0 QTime=11 
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to