[ 
https://issues.apache.org/jira/browse/SOLR-7198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15007731#comment-15007731
 ] 

Gregory Chanan commented on SOLR-7198:
--------------------------------------

I was able to reproduce this by adding some timeouts in various places.  It 
looks like in 4.10.3 this prevents a collection with that name from being 
created; on trunk you can still create a collection, although this isn't great 
because it leaves extra unnecessary data in ZK.

> Deleting a collection during leader election results in left over znodes in 
> ZK 
> -------------------------------------------------------------------------------
>
>                 Key: SOLR-7198
>                 URL: https://issues.apache.org/jira/browse/SOLR-7198
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.10.3
>            Reporter: Vamsee Yarlagadda
>            Assignee: Gregory Chanan
>
> I am seeing this issue while trying to run a collection delete operation in 
> the middle of leader election process.
> Contents of ZK (after issuing collection delete and waiting for some time)
> {code}
>   /
>     /aliases.json
>     /clusterstate.json
>     /collections
>       SolrUpgrade_collection
>         leaders
>            shard2_1
>     /configs
>     /live_nodes
>     /overseer
>     /overseer_elect
>     /solr
>     /solr.xml
> Contents of znode shard2_1:
> version0
> aversion0
> children_count0
> ctimeThu Mar 05 22:05:28 PST 2015 (1425621928169)
> cversion0
> czxid22815
> ephemeralOwner93427899815755800
> mtimeThu Mar 05 22:05:28 PST 2015 (1425621928169)
> mzxid22815
> pzxid22815
> dataLength194
> {
>   "core":"SolrUpgrade_collection_shard2_1_replica2",
>   "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr",
>   "base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr";
> }
> {code}
> Clusterstate.json before running a delete collection
> {code}
> {
>   "shards":{
>     "shard1":{
>       "range":"80000000-ffffffff",
>       "state":"active",
>       "replicas":{
>         "core_node1":{
>           "state":"active",
>           "core":"SolrUpgrade_collection_shard1_replica2",
>           "node_name":"search-testing-c5-ha-4.vpc.cloudera.com:8983_solr",
>           
> "base_url":"http://search-testing-c5-ha-4.vpc.cloudera.com:8983/solr";,
>           "leader":"true"},
>         "core_node2":{
>           "state":"active",
>           "core":"SolrUpgrade_collection_shard1_replica1",
>           "node_name":"search-testing-c5-ha-2.vpc.cloudera.com:8983_solr",
>           
> "base_url":"http://search-testing-c5-ha-2.vpc.cloudera.com:8983/solr"}}},
>     "shard2_0":{
>       "range":"0-3fffffff",
>       "state":"active",
>       "replicas":{
>         "core_node5":{
>           "state":"active",
>           "core":"SolrUpgrade_collection_shard2_0_replica1",
>           "node_name":"search-testing-c5-ha-3.vpc.cloudera.com:8983_solr",
>           
> "base_url":"http://search-testing-c5-ha-3.vpc.cloudera.com:8983/solr";,
>           "leader":"true"},
>         "core_node7":{
>           "state":"active",
>           "core":"SolrUpgrade_collection_shard2_0_replica2",
>           "node_name":"search-testing-c5-ha-4.vpc.cloudera.com:8983_solr",
>           
> "base_url":"http://search-testing-c5-ha-4.vpc.cloudera.com:8983/solr"}}},
>     "shard2_1":{
>       "range":"40000000-7fffffff",
>       "state":"active",
>       "replicas":{
>         "core_node8":{
>           "state":"active",
>           "core":"SolrUpgrade_collection_shard2_1_replica2",
>           "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr",
>           
> "base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr"},
>         "core_node9":{
>           "state":"active",
>           "core":"SolrUpgrade_collection_shard2_1_replica3",
>           "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr",
>           
> "base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr"}}}},
>   "maxShardsPerNode":"10",
>   "router":"compositeId",
>   "replicationFactor":"2",
>   "autoAddReplicas":"false",
>   "routerSpec":{"name":"compositeId"}}}
> {code}
> As we can notice, shard (*shard2_1*) doesn't have any leader and i can see 
> from the logs that the replicas of the shard just started the leader election 
> process. And here are the Solr logs from one of the above replicas which 
> eventually becomes the leader and registers in ZK even though the collection 
> was deleted.
> {code}
> 2015-03-05 22:05:25,383 INFO 
> org.apache.solr.cloud.ShardLeaderElectionContext: Running the leader process 
> for shard shard2_1
> 2015-03-05 22:05:25,387 INFO 
> org.apache.solr.cloud.ShardLeaderElectionContext: Checking if I 
> (core=SolrUpgrade_collection_shard2_1_replica2,coreNodeName=core_node8) 
> should try and be the leader.
> 2015-03-05 22:05:25,387 INFO 
> org.apache.solr.cloud.ShardLeaderElectionContext: My last published State was 
> Active, it's okay to be the leader.
> 2015-03-05 22:05:25,387 INFO 
> org.apache.solr.cloud.ShardLeaderElectionContext: I may be the new leader - 
> try and sync
> 2015-03-05 22:05:25,506 INFO org.apache.solr.common.cloud.ZkStateReader: A 
> cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
> 2015-03-05 22:05:25,620 INFO org.apache.solr.core.SolrCore.Request: 
> [SolrUpgrade_collection_shard2_1_replica2] webapp=/solr path=/select 
> params={q=*:*&distrib=false&wt=javabin&version=2} hits=118 status=0 QTime=1 
> 2015-03-05 22:05:25,623 INFO org.apache.solr.core.SolrCore.Request: 
> [SolrUpgrade_collection_shard2_1_replica3] webapp=/solr path=/select 
> params={q=*:*&distrib=false&wt=javabin&version=2} hits=118 status=0 QTime=0 
> 2015-03-05 22:05:27,392 INFO org.apache.solr.cloud.ElectionContext: canceling 
> election 
> /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node8-n_0000000001
> 2015-03-05 22:05:27,393 INFO org.apache.solr.core.SolrCore: 
> [SolrUpgrade_collection_shard2_1_replica3]  CLOSING SolrCore 
> org.apache.solr.core.SolrCore@559b52d3
> 2015-03-05 22:05:27,393 INFO org.apache.solr.update.UpdateHandler: closing 
> DirectUpdateHandler2{commits=0,autocommit maxTime=60000ms,autocommits=0,soft 
> autocommit maxTime=1000ms,soft 
> autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
> 2015-03-05 22:05:27,394 INFO org.apache.solr.update.SolrCoreState: Closing 
> SolrCoreState
> 2015-03-05 22:05:27,394 INFO org.apache.solr.update.DefaultSolrCoreState: 
> SolrCoreState ref count has reached 0 - closing IndexWriter
> 2015-03-05 22:05:27,394 INFO org.apache.solr.update.DefaultSolrCoreState: 
> closing IndexWriter with IndexWriterCloser
> 2015-03-05 22:05:27,397 INFO org.apache.solr.servlet.SolrDispatchFilter: 
> [admin] webapp=null path=/admin/cores 
> params={deleteInstanceDir=true&action=UNLOAD&core=SolrUpgrade_collection_shard2_1_replica2&wt=javabin&qt=/admin/cores&deleteDataDir=true&version=2}
>  status=0 QTime=5 
> 2015-03-05 22:05:27,399 INFO 
> org.apache.solr.cloud.ShardLeaderElectionContext: Running the leader process 
> for shard shard2_1
> 2015-03-05 22:05:27,399 INFO org.apache.solr.cloud.ElectionContext: canceling 
> election 
> /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002
> 2015-03-05 22:05:27,400 WARN org.apache.solr.cloud.LeaderElector: 
> org.apache.solr.common.SolrException: SolrCore not 
> found:SolrUpgrade_collection_shard2_1_replica3 in []
>       at 
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:211)
>       at 
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
>       at 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
>       at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
>       at 
> org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:358)
>       at 
> org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:273)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> 2015-03-05 22:05:27,415 INFO org.apache.solr.core.SolrCore: 
> [SolrUpgrade_collection_shard2_1_replica3] Closing main searcher on request.
> 2015-03-05 22:05:27,438 INFO org.apache.solr.common.cloud.ZkStateReader: A 
> cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
> 2015-03-05 22:05:27,456 INFO org.apache.solr.core.CachingDirectoryFactory: 
> Closing HdfsDirectoryFactory - 2 directories currently being tracked
> 2015-03-05 22:05:27,457 INFO org.apache.solr.core.CachingDirectoryFactory: 
> looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index 
> [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index;done=false>>]
> 2015-03-05 22:05:27,457 INFO org.apache.solr.core.CachingDirectoryFactory: 
> Closing directory: 
> hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index
> 2015-03-05 22:05:27,470 INFO org.apache.solr.store.hdfs.HdfsDirectory: 
> Closing hdfs directory 
> hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index
> 2015-03-05 22:05:27,470 INFO org.apache.solr.core.CachingDirectoryFactory: 
> looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data 
> [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data;done=false>>]
> 2015-03-05 22:05:27,470 INFO org.apache.solr.core.CachingDirectoryFactory: 
> Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data
> 2015-03-05 22:05:27,470 INFO org.apache.solr.store.hdfs.HdfsDirectory: 
> Closing hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data
> 2015-03-05 22:05:27,471 INFO org.apache.solr.core.CachingDirectoryFactory: 
> Removing directory after core close: 
> hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data
> 2015-03-05 22:05:27,490 INFO org.apache.solr.cloud.ElectionContext: canceling 
> election 
> /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002
> 2015-03-05 22:05:27,491 WARN org.apache.solr.cloud.ElectionContext: 
> cancelElection did not find election node to remove 
> /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002
> 2015-03-05 22:05:27,493 INFO org.apache.solr.servlet.SolrDispatchFilter: 
> [admin] webapp=null path=/admin/cores 
> params={deleteInstanceDir=true&action=UNLOAD&core=SolrUpgrade_collection_shard2_1_replica3&wt=javabin&qt=/admin/cores&deleteDataDir=true&version=2}
>  status=0 QTime=100 
> 2015-03-05 22:05:27,741 INFO org.apache.solr.common.cloud.ZkStateReader: A 
> cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
> 2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.SyncStrategy: Sync 
> replicas to 
> http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr/SolrUpgrade_collection_shard2_1_replica2/
> 2015-03-05 22:05:27,888 ERROR org.apache.solr.cloud.SyncStrategy: Sync 
> Failed:org.apache.solr.common.cloud.ZooKeeperException: Could not find 
> collection in zk: SolrUpgrade_collection []
>       at 
> org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:600)
>       at 
> org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:588)
>       at 
> org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:583)
>       at 
> org.apache.solr.cloud.SyncStrategy.syncWithReplicas(SyncStrategy.java:152)
>       at 
> org.apache.solr.cloud.SyncStrategy.syncReplicas(SyncStrategy.java:122)
>       at org.apache.solr.cloud.SyncStrategy.sync(SyncStrategy.java:101)
>       at 
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:267)
>       at 
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
>       at 
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
>       at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
>       at 
> org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:358)
>       at 
> org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:273)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745)
> 2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.SyncStrategy: Leader's 
> attempt to sync with shard failed, moving to the next candidate
> 2015-03-05 22:05:27,888 INFO 
> org.apache.solr.cloud.ShardLeaderElectionContext: We failed sync, but we have 
> no versions - we can't sync in that case - we were active before, so become 
> leader anyway
> 2015-03-05 22:05:27,888 INFO 
> org.apache.solr.cloud.ShardLeaderElectionContext: I am the new leader: 
> http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr/SolrUpgrade_collection_shard2_1_replica2/
>  shard2_1
> 2015-03-05 22:05:27,888 INFO org.apache.solr.core.SolrCore: 
> [SolrUpgrade_collection_shard2_1_replica2]  CLOSING SolrCore 
> org.apache.solr.core.SolrCore@595167
> 2015-03-05 22:05:27,889 INFO org.apache.solr.update.UpdateHandler: closing 
> DirectUpdateHandler2{commits=1,autocommit maxTime=60000ms,autocommits=0,soft 
> autocommit maxTime=1000ms,soft 
> autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
> 2015-03-05 22:05:27,890 INFO org.apache.solr.update.SolrCoreState: Closing 
> SolrCoreState
> 2015-03-05 22:05:27,890 INFO org.apache.solr.update.DefaultSolrCoreState: 
> SolrCoreState ref count has reached 0 - closing IndexWriter
> 2015-03-05 22:05:27,890 INFO org.apache.solr.update.DefaultSolrCoreState: 
> closing IndexWriter with IndexWriterCloser
> 2015-03-05 22:05:27,905 INFO org.apache.solr.core.SolrCore: 
> [SolrUpgrade_collection_shard2_1_replica2] Closing main searcher on request.
> 2015-03-05 22:05:27,911 INFO org.apache.solr.core.CachingDirectoryFactory: 
> Closing HdfsDirectoryFactory - 2 directories currently being tracked
> 2015-03-05 22:05:27,912 INFO org.apache.solr.core.CachingDirectoryFactory: 
> looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index 
> [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index;done=false>>]
> 2015-03-05 22:05:27,912 INFO org.apache.solr.core.CachingDirectoryFactory: 
> Closing directory: 
> hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index
> 2015-03-05 22:05:27,924 INFO org.apache.solr.store.hdfs.HdfsDirectory: 
> Closing hdfs directory 
> hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index
> 2015-03-05 22:05:27,924 INFO org.apache.solr.core.CachingDirectoryFactory: 
> looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data 
> [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data;done=false>>]
> 2015-03-05 22:05:27,924 INFO org.apache.solr.core.CachingDirectoryFactory: 
> Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data
> 2015-03-05 22:05:27,925 INFO org.apache.solr.store.hdfs.HdfsDirectory: 
> Closing hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data
> 2015-03-05 22:05:28,026 INFO org.apache.solr.core.CachingDirectoryFactory: 
> Removing directory after core close: 
> hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data
> 2015-03-05 22:05:28,155 INFO org.apache.solr.common.cloud.SolrZkClient: 
> makePath: /collections/SolrUpgrade_collection/leaders/shard2_1
> 2015-03-05 22:05:28,287 INFO org.apache.solr.common.cloud.ZkStateReader: A 
> cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged 
> path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
> {code}
> This might be because of a missing check at line
>  
> https://github.com/apache/lucene-solr/blob/372e8448021d00d3466b45da8a6e57736354eee8/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java#L138
> As the makePath creates all the missing directories recursively.
> May be we can check whether the collection znode exists before we create the 
> leader znode? Thoughts? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to