[
https://issues.apache.org/jira/browse/SOLR-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Per Steffensen updated SOLR-6133:
---------------------------------
Attachment: CollDelete.java
With this CollDelete.java (not only waiting for all shards active, but also for
all of their replica to be active) I am not able to recreate the problem.
One of my colleagues claims that he has seen the problem even when checking for
active replica, but I cannot recreate it. I will have to check with him.
But anyway, if all shards/replica must be active for a successful delete, we
should consider moving it to server-side. With a timeout, of course. Then the
server can improve the chance, that it either deletes nothing or deletes
everything.
> More robust collection-delete
> -----------------------------
>
> Key: SOLR-6133
> URL: https://issues.apache.org/jira/browse/SOLR-6133
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.7.2, 4.8.1
> Reporter: Per Steffensen
> Attachments: CollDelete.java, coll_delete_problem.zip
>
>
> If Solrs are not "stable" (completely up and running etc) a collection-delete
> request might result in partly deleted collections. You might say that it is
> fair that you are not able to have a collection deleted if all of its shards
> are not actively running - even though I would like a mechanism that just
> deleted them when/if they ever come up again. But even though all shards
> claim to be actively running you can still end up with partly deleted
> collections - that is not acceptable IMHO. At least clusterstate should
> always reflect the state, so that you are able to detect that your
> collection-delete request was only partly carried out - which parts were
> successfully deleted and which were not (including information about
> data-folder-deletion)
> The text above sounds like an epic-sized task, with potentially numerous
> problems to fix, so in order not to make this ticket "open forever" I will
> point out a particular scenario where I see problems. Then this problem is
> corrected we can close this ticket. Other tickets will have to deal with
> other collection-delete issues.
> Here is what I did and saw
> * Logged into one of my Linux machines with IP 192.168.78.239
> * Prepared for Solr install
> {code}
> mkdir -p /xXX/solr
> cd /xXX/solr
> {code}
> * downloaded solr-4.7.2.tgz
> * Installed Solr 4.7.2 and prepared for three "nodes"
> {code}
> tar zxvf solr-4.7.2.tgz
> cd solr-4.7.2/
> cp -r example node1
> cp -r example node2
> cp -r example node3
> {code}
> * Initialized Solr config into Solr
> {code}
> cd node1
> java -DzkRun -Dhost=192.168.78.239
> -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf
> -jar start.jar
> CTRL-C to stop solr (node1) again after it started completely
> {code}
> * Started all three Solr nodes
> {code}
> nohup java -Djetty.port=8983 -Dhost=192.168.78.239 -DzkRun -jar start.jar &>>
> node1_stdouterr.log &
> cd ../node2
> nohup java -Djetty.port=8984 -Dhost=192.168.78.239 -DzkHost=localhost:9983
> -jar start.jar &>> node2_stdouterr.log &
> cd ../node3
> nohup java -Djetty.port=8985 -Dhost=192.168.78.239 -DzkHost=localhost:9983
> -jar start.jar &>> node3_stdouterr.log &
> {code}
> * Created a collection "mycoll"
> {code}
> curl
> 'http://192.168.78.239:8983/solr/admin/collections?action=CREATE&name=mycoll&numShards=6&replicationFactor=1&maxShardsPerNode=2&collection.configName=myconf'
> {code}
> * Collected "Cloud Graph" image, clusterstate.json and info about data
> folders (see attached coll_delete_problem.zip |
> after_create_all_solrs_still_running). You will see that everything is as it
> is supposed to be. Two shards per node, six all in all, it is all reflected
> in clusterstate and there is a data-folder for each shard
> * Stopped all three Solr nodes
> {code}
> kill $(ps -ef | grep 8985 | grep -v grep | awk '{print $2}')
> kill $(ps -ef | grep 8984 | grep -v grep | awk '{print $2}')
> kill $(ps -ef | grep 8983 | grep -v grep | awk '{print $2}')
> {code}
> * Started Solr node1 only (wait for it to start completely)
> {code}
> cd ../node1
> nohup java -Djetty.port=8983 -Dhost=192.168.78.239 -DzkRun -jar start.jar &>>
> node1_stdouterr.log &
> Wait for it to start fully - might take a minute or so
> {code}
> * Collected "Cloud Graph" image, clusterstate.json and info about data
> folders (see attached coll_delete_problem.zip |
> after_create_solr1_restarted_solr2_and_3_not_started_yet). You will see that
> everything is as it is supposed to be. Two shards per node, six all in all,
> the four on node2 and node3 are down, it is all reflected in clusterstate and
> there is still a data-folder for each shard
> * Started CollDelete.java (see attached coll_delete_problem.zip) - will
> delete collection "mycoll" when all three Solrs are live and all shards are
> "active"
> * Started the remaining two Solr nodes
> {code}
> cd ../node2
> nohup java -Djetty.port=8984 -Dhost=192.168.78.239 -DzkHost=localhost:9983
> -jar start.jar &>> node2_stdouterr.log &
> cd ../node3
> nohup java -Djetty.port=8985 -Dhost=192.168.78.239 -DzkHost=localhost:9983
> -jar start.jar &>> node3_stdouterr.log &
> {code}
> * After CollDelete.java finished, collected "Cloud Graph" image,
> clusterstate.json, info about data folders and output from CollDelete.java
> (see attached coll_delete_problem.zip |
> after_create_all_solrs_restarted_delete_coll_while_solr2_and_3_was_starting_up).
> You will see that not everything is as it is supposed to be. Al info about
> "mycoll" deleted from clusterstate - ok. But data-folders remain for node2
> and node3 - not ok.
> ** CollDelete output
> {code}
> All 3 solrs live
> All (6) shards active. Now deleting
> {responseHeader={status=0,QTime=1823},failure={192.168.78.239:8985_solr=org.apache.solr.client.solrj.SolrServerException:Server
> refused connection at:
> http://192.168.78.239:8985/solr,192.168.78.239:8984_solr=org.apache.solr.client.solrj.SolrServerException:Server
> refused connection at:
> http://192.168.78.239:8984/solr,192.168.78.239:8984_solr=org.apache.solr.client.solrj.SolrServerException:Server
> refused connection at:
> http://192.168.78.239:8984/solr,192.168.78.239:8985_solr=org.apache.solr.client.solrj.SolrServerException:Server
> refused connection at:
> http://192.168.78.239:8985/solr},success={192.168.78.239:8983_solr={responseHeader={status=0,QTime=305}},192.168.78.239:8983_solr={responseHeader={status=0,QTime=260}}}}
> {code}
> * Please note, that consecutive attempts to collection-delete "mycoll" will
> fail, because Solr claims that "mycoll" does not exist.
> * I stopped the Solrs again
> * Collected stdouterr files (see attached coll_delete_problem.zip)
> In this scenario you see that because you send the delete-collection request
> while some Solrs have not completely started yet, you will end up in a
> situation where it seems like the collection has been deleted, but
> data-folders are still left on disk taking up disk-space. The most
> significant thing is that this happens even though then client (sending the
> delete-request) waits until all Solrs are live and all shards of the
> collection to be deleted claim to be active. What more can a careful client
> do?
> *In this particular case, where you specifically wait for solrs to be live
> and shards active, I think we should make sure that everything is deleted
> (including folders) correctly*
> I am not looking for a bullet-proof solution. I believe we can always some up
> with crazy scenarios where you end up with a half deleted collection. But
> this particular scenario should work, I believe.
> Please note, that I have seen other scenarios where only parts of the stuff
> in clusterstate is deleted (try removing the parts about waiting for active
> shards in CollDelete.java - so that you are only waiting for live Solrs),
> they just seem to be harder to reproduce consistently. But the fact that you
> can have such situations also, might help when designing the robust solution.
> Please also note, that I tested this on 4.7.2 because it is the latest java6
> enabled release and I only had java6 on my machine. One of my colleagues have
> tested 4.8.1 on a machine with java7 - no difference.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]