[
https://issues.apache.org/jira/browse/SOLR-7198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15007731#comment-15007731
]
Gregory Chanan commented on SOLR-7198:
--------------------------------------
I was able to reproduce this by adding some timeouts in various places. It
looks like in 4.10.3 this prevents a collection with that name from being
created; on trunk you can still create a collection, although this isn't great
because it leaves extra unnecessary data in ZK.
> Deleting a collection during leader election results in left over znodes in
> ZK
> -------------------------------------------------------------------------------
>
> Key: SOLR-7198
> URL: https://issues.apache.org/jira/browse/SOLR-7198
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.10.3
> Reporter: Vamsee Yarlagadda
> Assignee: Gregory Chanan
>
> I am seeing this issue while trying to run a collection delete operation in
> the middle of leader election process.
> Contents of ZK (after issuing collection delete and waiting for some time)
> {code}
> /
> /aliases.json
> /clusterstate.json
> /collections
> SolrUpgrade_collection
> leaders
> shard2_1
> /configs
> /live_nodes
> /overseer
> /overseer_elect
> /solr
> /solr.xml
> Contents of znode shard2_1:
> version0
> aversion0
> children_count0
> ctimeThu Mar 05 22:05:28 PST 2015 (1425621928169)
> cversion0
> czxid22815
> ephemeralOwner93427899815755800
> mtimeThu Mar 05 22:05:28 PST 2015 (1425621928169)
> mzxid22815
> pzxid22815
> dataLength194
> {
> "core":"SolrUpgrade_collection_shard2_1_replica2",
> "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr",
> "base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr"
> }
> {code}
> Clusterstate.json before running a delete collection
> {code}
> {
> "shards":{
> "shard1":{
> "range":"80000000-ffffffff",
> "state":"active",
> "replicas":{
> "core_node1":{
> "state":"active",
> "core":"SolrUpgrade_collection_shard1_replica2",
> "node_name":"search-testing-c5-ha-4.vpc.cloudera.com:8983_solr",
>
> "base_url":"http://search-testing-c5-ha-4.vpc.cloudera.com:8983/solr",
> "leader":"true"},
> "core_node2":{
> "state":"active",
> "core":"SolrUpgrade_collection_shard1_replica1",
> "node_name":"search-testing-c5-ha-2.vpc.cloudera.com:8983_solr",
>
> "base_url":"http://search-testing-c5-ha-2.vpc.cloudera.com:8983/solr"}}},
> "shard2_0":{
> "range":"0-3fffffff",
> "state":"active",
> "replicas":{
> "core_node5":{
> "state":"active",
> "core":"SolrUpgrade_collection_shard2_0_replica1",
> "node_name":"search-testing-c5-ha-3.vpc.cloudera.com:8983_solr",
>
> "base_url":"http://search-testing-c5-ha-3.vpc.cloudera.com:8983/solr",
> "leader":"true"},
> "core_node7":{
> "state":"active",
> "core":"SolrUpgrade_collection_shard2_0_replica2",
> "node_name":"search-testing-c5-ha-4.vpc.cloudera.com:8983_solr",
>
> "base_url":"http://search-testing-c5-ha-4.vpc.cloudera.com:8983/solr"}}},
> "shard2_1":{
> "range":"40000000-7fffffff",
> "state":"active",
> "replicas":{
> "core_node8":{
> "state":"active",
> "core":"SolrUpgrade_collection_shard2_1_replica2",
> "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr",
>
> "base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr"},
> "core_node9":{
> "state":"active",
> "core":"SolrUpgrade_collection_shard2_1_replica3",
> "node_name":"search-testing-c5-ha-1.vpc.cloudera.com:8983_solr",
>
> "base_url":"http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr"}}}},
> "maxShardsPerNode":"10",
> "router":"compositeId",
> "replicationFactor":"2",
> "autoAddReplicas":"false",
> "routerSpec":{"name":"compositeId"}}}
> {code}
> As we can notice, shard (*shard2_1*) doesn't have any leader and i can see
> from the logs that the replicas of the shard just started the leader election
> process. And here are the Solr logs from one of the above replicas which
> eventually becomes the leader and registers in ZK even though the collection
> was deleted.
> {code}
> 2015-03-05 22:05:25,383 INFO
> org.apache.solr.cloud.ShardLeaderElectionContext: Running the leader process
> for shard shard2_1
> 2015-03-05 22:05:25,387 INFO
> org.apache.solr.cloud.ShardLeaderElectionContext: Checking if I
> (core=SolrUpgrade_collection_shard2_1_replica2,coreNodeName=core_node8)
> should try and be the leader.
> 2015-03-05 22:05:25,387 INFO
> org.apache.solr.cloud.ShardLeaderElectionContext: My last published State was
> Active, it's okay to be the leader.
> 2015-03-05 22:05:25,387 INFO
> org.apache.solr.cloud.ShardLeaderElectionContext: I may be the new leader -
> try and sync
> 2015-03-05 22:05:25,506 INFO org.apache.solr.common.cloud.ZkStateReader: A
> cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged
> path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
> 2015-03-05 22:05:25,620 INFO org.apache.solr.core.SolrCore.Request:
> [SolrUpgrade_collection_shard2_1_replica2] webapp=/solr path=/select
> params={q=*:*&distrib=false&wt=javabin&version=2} hits=118 status=0 QTime=1
> 2015-03-05 22:05:25,623 INFO org.apache.solr.core.SolrCore.Request:
> [SolrUpgrade_collection_shard2_1_replica3] webapp=/solr path=/select
> params={q=*:*&distrib=false&wt=javabin&version=2} hits=118 status=0 QTime=0
> 2015-03-05 22:05:27,392 INFO org.apache.solr.cloud.ElectionContext: canceling
> election
> /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node8-n_0000000001
> 2015-03-05 22:05:27,393 INFO org.apache.solr.core.SolrCore:
> [SolrUpgrade_collection_shard2_1_replica3] CLOSING SolrCore
> org.apache.solr.core.SolrCore@559b52d3
> 2015-03-05 22:05:27,393 INFO org.apache.solr.update.UpdateHandler: closing
> DirectUpdateHandler2{commits=0,autocommit maxTime=60000ms,autocommits=0,soft
> autocommit maxTime=1000ms,soft
> autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
> 2015-03-05 22:05:27,394 INFO org.apache.solr.update.SolrCoreState: Closing
> SolrCoreState
> 2015-03-05 22:05:27,394 INFO org.apache.solr.update.DefaultSolrCoreState:
> SolrCoreState ref count has reached 0 - closing IndexWriter
> 2015-03-05 22:05:27,394 INFO org.apache.solr.update.DefaultSolrCoreState:
> closing IndexWriter with IndexWriterCloser
> 2015-03-05 22:05:27,397 INFO org.apache.solr.servlet.SolrDispatchFilter:
> [admin] webapp=null path=/admin/cores
> params={deleteInstanceDir=true&action=UNLOAD&core=SolrUpgrade_collection_shard2_1_replica2&wt=javabin&qt=/admin/cores&deleteDataDir=true&version=2}
> status=0 QTime=5
> 2015-03-05 22:05:27,399 INFO
> org.apache.solr.cloud.ShardLeaderElectionContext: Running the leader process
> for shard shard2_1
> 2015-03-05 22:05:27,399 INFO org.apache.solr.cloud.ElectionContext: canceling
> election
> /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002
> 2015-03-05 22:05:27,400 WARN org.apache.solr.cloud.LeaderElector:
> org.apache.solr.common.SolrException: SolrCore not
> found:SolrUpgrade_collection_shard2_1_replica3 in []
> at
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:211)
> at
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
> at
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
> at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
> at
> org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:358)
> at
> org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:273)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 2015-03-05 22:05:27,415 INFO org.apache.solr.core.SolrCore:
> [SolrUpgrade_collection_shard2_1_replica3] Closing main searcher on request.
> 2015-03-05 22:05:27,438 INFO org.apache.solr.common.cloud.ZkStateReader: A
> cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged
> path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
> 2015-03-05 22:05:27,456 INFO org.apache.solr.core.CachingDirectoryFactory:
> Closing HdfsDirectoryFactory - 2 directories currently being tracked
> 2015-03-05 22:05:27,457 INFO org.apache.solr.core.CachingDirectoryFactory:
> looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index
> [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index;done=false>>]
> 2015-03-05 22:05:27,457 INFO org.apache.solr.core.CachingDirectoryFactory:
> Closing directory:
> hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index
> 2015-03-05 22:05:27,470 INFO org.apache.solr.store.hdfs.HdfsDirectory:
> Closing hdfs directory
> hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data/index
> 2015-03-05 22:05:27,470 INFO org.apache.solr.core.CachingDirectoryFactory:
> looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data
> [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data;done=false>>]
> 2015-03-05 22:05:27,470 INFO org.apache.solr.core.CachingDirectoryFactory:
> Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data
> 2015-03-05 22:05:27,470 INFO org.apache.solr.store.hdfs.HdfsDirectory:
> Closing hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data
> 2015-03-05 22:05:27,471 INFO org.apache.solr.core.CachingDirectoryFactory:
> Removing directory after core close:
> hdfs://ns1/solr/SolrUpgrade_collection/core_node9/data
> 2015-03-05 22:05:27,490 INFO org.apache.solr.cloud.ElectionContext: canceling
> election
> /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002
> 2015-03-05 22:05:27,491 WARN org.apache.solr.cloud.ElectionContext:
> cancelElection did not find election node to remove
> /collections/SolrUpgrade_collection/leader_elect/shard2_1/election/93427899815755803-core_node9-n_0000000002
> 2015-03-05 22:05:27,493 INFO org.apache.solr.servlet.SolrDispatchFilter:
> [admin] webapp=null path=/admin/cores
> params={deleteInstanceDir=true&action=UNLOAD&core=SolrUpgrade_collection_shard2_1_replica3&wt=javabin&qt=/admin/cores&deleteDataDir=true&version=2}
> status=0 QTime=100
> 2015-03-05 22:05:27,741 INFO org.apache.solr.common.cloud.ZkStateReader: A
> cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged
> path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
> 2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.SyncStrategy: Sync
> replicas to
> http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr/SolrUpgrade_collection_shard2_1_replica2/
> 2015-03-05 22:05:27,888 ERROR org.apache.solr.cloud.SyncStrategy: Sync
> Failed:org.apache.solr.common.cloud.ZooKeeperException: Could not find
> collection in zk: SolrUpgrade_collection []
> at
> org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:600)
> at
> org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:588)
> at
> org.apache.solr.common.cloud.ZkStateReader.getReplicaProps(ZkStateReader.java:583)
> at
> org.apache.solr.cloud.SyncStrategy.syncWithReplicas(SyncStrategy.java:152)
> at
> org.apache.solr.cloud.SyncStrategy.syncReplicas(SyncStrategy.java:122)
> at org.apache.solr.cloud.SyncStrategy.sync(SyncStrategy.java:101)
> at
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:267)
> at
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:163)
> at
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:125)
> at org.apache.solr.cloud.LeaderElector.access$200(LeaderElector.java:55)
> at
> org.apache.solr.cloud.LeaderElector$ElectionWatcher.process(LeaderElector.java:358)
> at
> org.apache.solr.common.cloud.SolrZkClient$3$1.run(SolrZkClient.java:273)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> 2015-03-05 22:05:27,888 INFO org.apache.solr.cloud.SyncStrategy: Leader's
> attempt to sync with shard failed, moving to the next candidate
> 2015-03-05 22:05:27,888 INFO
> org.apache.solr.cloud.ShardLeaderElectionContext: We failed sync, but we have
> no versions - we can't sync in that case - we were active before, so become
> leader anyway
> 2015-03-05 22:05:27,888 INFO
> org.apache.solr.cloud.ShardLeaderElectionContext: I am the new leader:
> http://search-testing-c5-ha-1.vpc.cloudera.com:8983/solr/SolrUpgrade_collection_shard2_1_replica2/
> shard2_1
> 2015-03-05 22:05:27,888 INFO org.apache.solr.core.SolrCore:
> [SolrUpgrade_collection_shard2_1_replica2] CLOSING SolrCore
> org.apache.solr.core.SolrCore@595167
> 2015-03-05 22:05:27,889 INFO org.apache.solr.update.UpdateHandler: closing
> DirectUpdateHandler2{commits=1,autocommit maxTime=60000ms,autocommits=0,soft
> autocommit maxTime=1000ms,soft
> autocommits=0,optimizes=0,rollbacks=0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0,transaction_logs_total_size=0,transaction_logs_total_number=0}
> 2015-03-05 22:05:27,890 INFO org.apache.solr.update.SolrCoreState: Closing
> SolrCoreState
> 2015-03-05 22:05:27,890 INFO org.apache.solr.update.DefaultSolrCoreState:
> SolrCoreState ref count has reached 0 - closing IndexWriter
> 2015-03-05 22:05:27,890 INFO org.apache.solr.update.DefaultSolrCoreState:
> closing IndexWriter with IndexWriterCloser
> 2015-03-05 22:05:27,905 INFO org.apache.solr.core.SolrCore:
> [SolrUpgrade_collection_shard2_1_replica2] Closing main searcher on request.
> 2015-03-05 22:05:27,911 INFO org.apache.solr.core.CachingDirectoryFactory:
> Closing HdfsDirectoryFactory - 2 directories currently being tracked
> 2015-03-05 22:05:27,912 INFO org.apache.solr.core.CachingDirectoryFactory:
> looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index
> [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index;done=false>>]
> 2015-03-05 22:05:27,912 INFO org.apache.solr.core.CachingDirectoryFactory:
> Closing directory:
> hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index
> 2015-03-05 22:05:27,924 INFO org.apache.solr.store.hdfs.HdfsDirectory:
> Closing hdfs directory
> hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data/index
> 2015-03-05 22:05:27,924 INFO org.apache.solr.core.CachingDirectoryFactory:
> looking to close hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data
> [CachedDir<<refCount=0;path=hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data;done=false>>]
> 2015-03-05 22:05:27,924 INFO org.apache.solr.core.CachingDirectoryFactory:
> Closing directory: hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data
> 2015-03-05 22:05:27,925 INFO org.apache.solr.store.hdfs.HdfsDirectory:
> Closing hdfs directory hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data
> 2015-03-05 22:05:28,026 INFO org.apache.solr.core.CachingDirectoryFactory:
> Removing directory after core close:
> hdfs://ns1/solr/SolrUpgrade_collection/core_node8/data
> 2015-03-05 22:05:28,155 INFO org.apache.solr.common.cloud.SolrZkClient:
> makePath: /collections/SolrUpgrade_collection/leaders/shard2_1
> 2015-03-05 22:05:28,287 INFO org.apache.solr.common.cloud.ZkStateReader: A
> cluster state change: WatchedEvent state:SyncConnected type:NodeDataChanged
> path:/clusterstate.json, has occurred - updating... (live nodes size: 4)
> {code}
> This might be because of a missing check at line
>
> https://github.com/apache/lucene-solr/blob/372e8448021d00d3466b45da8a6e57736354eee8/solr/core/src/java/org/apache/solr/cloud/ElectionContext.java#L138
> As the makePath creates all the missing directories recursively.
> May be we can check whether the collection znode exists before we create the
> leader znode? Thoughts?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]