Per Steffensen created SOLR-6133:
------------------------------------

             Summary: More robust collection-delete
                 Key: SOLR-6133
                 URL: https://issues.apache.org/jira/browse/SOLR-6133
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
    Affects Versions: 4.8.1, 4.7.2
            Reporter: Per Steffensen


If Solrs are not "stable" (completely up and running etc) a collection-delete 
request might result in partly deleted collections. You might say that it is 
fair that you are not able to have a collection deleted if all of its shards 
are not actively running - even though I would like a mechanism that just 
deleted them when/if they ever come up again. But even though all shards claim 
to be actively running you can still end up with partly deleted collections - 
that is not acceptable IMHO. At least clusterstate should always reflect the 
state, so that you are able to detect that your collection-delete request was 
only partly carried out - which parts were successfully deleted and which were 
not (including information about data-folder-deletion)

The text above sounds like an epic-sized task, with potentially numerous 
problems to fix, so in order not to make this ticket "open forever" I will 
point out a particular scenario where I see problems. Then this problem is 
corrected we can close this ticket. Other tickets will have to deal with other 
collection-delete issues.

Here is what I did and saw
* Logged into one of my Linux machines with IP 192.168.78.239
* Prepared for Solr install
{code}
mkdir -p /xXX/solr
cd /xXX/solr
{code}
* downloaded solr-4.7.2.tgz
* Installed Solr 4.7.2 and prepared for three "nodes"
{code}
tar zxvf solr-4.7.2.tgz
cd solr-4.7.2/
cp -r example node1
cp -r example node2
cp -r example node3
{code}
* Initialized Solr config into Solr
{code}
cd node1
java -DzkRun -Dhost=192.168.78.239 -Dbootstrap_confdir=./solr/collection1/conf 
-Dcollection.configName=myconf -jar start.jar
CTRL-C to stop solr (node1) again after it started completely
{code}
* Started all three Solr nodes
{code}
nohup java -Djetty.port=8983 -Dhost=192.168.78.239 -DzkRun -jar start.jar &>> 
node1_stdouterr.log &
cd ../node2
nohup java -Djetty.port=8984 -Dhost=192.168.78.239 -DzkHost=localhost:9983 -jar 
start.jar &>> node2_stdouterr.log &
cd ../node3
nohup java -Djetty.port=8985 -Dhost=192.168.78.239 -DzkHost=localhost:9983 -jar 
start.jar &>> node3_stdouterr.log &
{code}
* Created a collection "mycoll"
{code}
curl 
'http://192.168.78.239:8983/solr/admin/collections?action=CREATE&name=mycoll&numShards=6&replicationFactor=1&maxShardsPerNode=2&collection.configName=myconf'
{code}
* Collected "Cloud Graph" image, clusterstate.json and info about data folders 
(see attached coll_delete_problem.zip | after_create_all_solrs_still_running). 
You will see that everything is as it is supposed to be. Two shards per node, 
six all in all, it is all reflected in clusterstate and there is a data-folder 
for each shard
* Stopped all three Solr nodes
{code}
kill $(ps -ef | grep 8985 | grep -v grep | awk '{print $2}')
kill $(ps -ef | grep 8984 | grep -v grep | awk '{print $2}')
kill $(ps -ef | grep 8983 | grep -v grep | awk '{print $2}')
{code}
* Started Solr node1 only (wait for it to start completely)
{code}
cd ../node1
nohup java -Djetty.port=8983 -Dhost=192.168.78.239 -DzkRun -jar start.jar &>> 
node1_stdouterr.log &
Wait for it to start fully - might take a minute or so
{code}
* Collected "Cloud Graph" image, clusterstate.json and info about data folders 
(see attached coll_delete_problem.zip | 
after_create_solr1_restarted_solr2_and_3_not_started_yet). You will see that 
everything is as it is supposed to be. Two shards per node, six all in all, the 
four on node2 and node3 are down, it is all reflected in clusterstate and there 
is still a data-folder for each shard
* Started CollDelete.java (see attached coll_delete_problem.zip) - will delete 
collection "mycoll" when all three Solrs are live and all shards are "active"
* Started the remaining two Solr nodes
{code}
cd ../node2
nohup java -Djetty.port=8984 -Dhost=192.168.78.239 -DzkHost=localhost:9983 -jar 
start.jar &>> node2_stdouterr.log &
cd ../node3
nohup java -Djetty.port=8985 -Dhost=192.168.78.239 -DzkHost=localhost:9983 -jar 
start.jar &>> node3_stdouterr.log &
{code}
* After CollDelete.java finished, collected "Cloud Graph" image, 
clusterstate.json, info about data folders and output from CollDelete.java (see 
attached coll_delete_problem.zip | 
after_create_all_solrs_restarted_delete_coll_while_solr2_and_3_was_starting_up).
 You will see that not everything is as it is supposed to be. Al info about 
"mycoll" deleted from clusterstate - ok. But data-folders remain for node2 and 
node3 - not ok.
** CollDelete output
{code}
All 3 solrs live
All (6) shards active. Now deleting
{responseHeader={status=0,QTime=1823},failure={192.168.78.239:8985_solr=org.apache.solr.client.solrj.SolrServerException:Server
 refused connection at: 
http://192.168.78.239:8985/solr,192.168.78.239:8984_solr=org.apache.solr.client.solrj.SolrServerException:Server
 refused connection at: 
http://192.168.78.239:8984/solr,192.168.78.239:8984_solr=org.apache.solr.client.solrj.SolrServerException:Server
 refused connection at: 
http://192.168.78.239:8984/solr,192.168.78.239:8985_solr=org.apache.solr.client.solrj.SolrServerException:Server
 refused connection at: 
http://192.168.78.239:8985/solr},success={192.168.78.239:8983_solr={responseHeader={status=0,QTime=305}},192.168.78.239:8983_solr={responseHeader={status=0,QTime=260}}}}
{code}
* Please note, that consecutive attempts to collection-delete "mycoll" will 
fail, because Solr claims that "mycoll" does not exist.
* I stopped the Solrs again
* Collected stdouterr files (see attached coll_delete_problem.zip)

In this scenario you see that because you send the delete-collection request 
while some Solrs have not completely started yet, you will end up in a 
situation where it seems like the collection has been deleted, but data-folders 
are still left on disk taking up disk-space. The most significant thing is that 
this happens even though then client (sending the delete-request) waits until 
all Solrs are live and all shards of the collection to be deleted claim to be 
active. What more can a careful client do?

*In this particular case, where you specifically wait for solrs to be live and 
shards active, I think we should make sure that everything is deleted 
(including folders) correctly*

I am not looking for a bullet-proof solution. I believe we can always some up 
with crazy scenarios where you end up with a half deleted collection. But this 
particular scenario should work, I believe.

Please note, that I have seen other scenarios where only parts of the stuff in 
clusterstate is deleted (try removing the parts about waiting for active shards 
in CollDelete.java - so that you are only waiting for live Solrs), they just 
seem to be harder to reproduce consistently. But the fact that you can have 
such situations also, might help when designing the robust solution.

Please also note, that I tested this on 4.7.2 because it is the latest java6 
enabled release and I only had java6 on my machine. One of my colleagues have 
tested 4.8.1 on a machine with java7 - no difference.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to