[
https://issues.apache.org/jira/browse/SOLR-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017439#comment-14017439
]
Per Steffensen commented on SOLR-6133:
--------------------------------------
In general zk=truth sounds like a great idea :-) But shouldnt zk=truth be
implicit when either zkRun or zkHost is set?
I am not sure about the terminology, but I believe "unloaded" does not include
deleting the data from disk? My main concern with the scenario I show is that
data is not being deleted from disk. We would really like some way to make
(fairly) sure that data is deleted when we fire a collection-delete request
(and info disappears from ZK). We have enormous amounts of data and will run
out of disk-space if we do not have our data-folders deleted.
I am also a little bit concerned about the "on startup" part of "will be
unloaded on startup". In the scenario I show above, the shards that where
deleted from zk but not from disk, will pop up in zk again on restart of Solrs
(because the folders still contain core.properties I believe), and then we get
a second chance deleting them because we can re-detect that an unwanted
collection (partly) exists. So if zk=truth means that data will not be deleted,
and that the shards will not re-appear in zk after restart of Solrs, it is
actually a step back wrt my main concern. But back to my concern with "on
startup": Actually we very rarely restart Solrs (because they run fairly stable
- that is a good thing), so I am concerned with a solution that only "cleans up
or recovers" on restart.
I am keen on improving collection-delete to do whatever it can to be "all or
nothing". Will you consider adding to Solr server-side the "check for all nodes
are live and check for all shards/replica are active before delete" from
CollDelete.java. This will be a step in the "all or nothing" direction, which
will be even more important for non SolrJ clients, that really cannot do the
trick themselves on client-side (unless they do the zk-data joggeling on
client-side in another way).
> More robust collection-delete
> -----------------------------
>
> Key: SOLR-6133
> URL: https://issues.apache.org/jira/browse/SOLR-6133
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Affects Versions: 4.7.2, 4.8.1
> Reporter: Per Steffensen
> Attachments: CollDelete.java, coll_delete_problem.zip
>
>
> If Solrs are not "stable" (completely up and running etc) a collection-delete
> request might result in partly deleted collections. You might say that it is
> fair that you are not able to have a collection deleted if all of its shards
> are not actively running - even though I would like a mechanism that just
> deleted them when/if they ever come up again. But even though all shards
> claim to be actively running you can still end up with partly deleted
> collections - that is not acceptable IMHO. At least clusterstate should
> always reflect the state, so that you are able to detect that your
> collection-delete request was only partly carried out - which parts were
> successfully deleted and which were not (including information about
> data-folder-deletion)
> The text above sounds like an epic-sized task, with potentially numerous
> problems to fix, so in order not to make this ticket "open forever" I will
> point out a particular scenario where I see problems. Then this problem is
> corrected we can close this ticket. Other tickets will have to deal with
> other collection-delete issues.
> Here is what I did and saw
> * Logged into one of my Linux machines with IP 192.168.78.239
> * Prepared for Solr install
> {code}
> mkdir -p /xXX/solr
> cd /xXX/solr
> {code}
> * downloaded solr-4.7.2.tgz
> * Installed Solr 4.7.2 and prepared for three "nodes"
> {code}
> tar zxvf solr-4.7.2.tgz
> cd solr-4.7.2/
> cp -r example node1
> cp -r example node2
> cp -r example node3
> {code}
> * Initialized Solr config into Solr
> {code}
> cd node1
> java -DzkRun -Dhost=192.168.78.239
> -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf
> -jar start.jar
> CTRL-C to stop solr (node1) again after it started completely
> {code}
> * Started all three Solr nodes
> {code}
> nohup java -Djetty.port=8983 -Dhost=192.168.78.239 -DzkRun -jar start.jar &>>
> node1_stdouterr.log &
> cd ../node2
> nohup java -Djetty.port=8984 -Dhost=192.168.78.239 -DzkHost=localhost:9983
> -jar start.jar &>> node2_stdouterr.log &
> cd ../node3
> nohup java -Djetty.port=8985 -Dhost=192.168.78.239 -DzkHost=localhost:9983
> -jar start.jar &>> node3_stdouterr.log &
> {code}
> * Created a collection "mycoll"
> {code}
> curl
> 'http://192.168.78.239:8983/solr/admin/collections?action=CREATE&name=mycoll&numShards=6&replicationFactor=1&maxShardsPerNode=2&collection.configName=myconf'
> {code}
> * Collected "Cloud Graph" image, clusterstate.json and info about data
> folders (see attached coll_delete_problem.zip |
> after_create_all_solrs_still_running). You will see that everything is as it
> is supposed to be. Two shards per node, six all in all, it is all reflected
> in clusterstate and there is a data-folder for each shard
> * Stopped all three Solr nodes
> {code}
> kill $(ps -ef | grep 8985 | grep -v grep | awk '{print $2}')
> kill $(ps -ef | grep 8984 | grep -v grep | awk '{print $2}')
> kill $(ps -ef | grep 8983 | grep -v grep | awk '{print $2}')
> {code}
> * Started Solr node1 only (wait for it to start completely)
> {code}
> cd ../node1
> nohup java -Djetty.port=8983 -Dhost=192.168.78.239 -DzkRun -jar start.jar &>>
> node1_stdouterr.log &
> Wait for it to start fully - might take a minute or so
> {code}
> * Collected "Cloud Graph" image, clusterstate.json and info about data
> folders (see attached coll_delete_problem.zip |
> after_create_solr1_restarted_solr2_and_3_not_started_yet). You will see that
> everything is as it is supposed to be. Two shards per node, six all in all,
> the four on node2 and node3 are down, it is all reflected in clusterstate and
> there is still a data-folder for each shard
> * Started CollDelete.java (see attached coll_delete_problem.zip) - will
> delete collection "mycoll" when all three Solrs are live and all shards are
> "active"
> * Started the remaining two Solr nodes
> {code}
> cd ../node2
> nohup java -Djetty.port=8984 -Dhost=192.168.78.239 -DzkHost=localhost:9983
> -jar start.jar &>> node2_stdouterr.log &
> cd ../node3
> nohup java -Djetty.port=8985 -Dhost=192.168.78.239 -DzkHost=localhost:9983
> -jar start.jar &>> node3_stdouterr.log &
> {code}
> * After CollDelete.java finished, collected "Cloud Graph" image,
> clusterstate.json, info about data folders and output from CollDelete.java
> (see attached coll_delete_problem.zip |
> after_create_all_solrs_restarted_delete_coll_while_solr2_and_3_was_starting_up).
> You will see that not everything is as it is supposed to be. Al info about
> "mycoll" deleted from clusterstate - ok. But data-folders remain for node2
> and node3 - not ok.
> ** CollDelete output
> {code}
> All 3 solrs live
> All (6) shards active. Now deleting
> {responseHeader={status=0,QTime=1823},failure={192.168.78.239:8985_solr=org.apache.solr.client.solrj.SolrServerException:Server
> refused connection at:
> http://192.168.78.239:8985/solr,192.168.78.239:8984_solr=org.apache.solr.client.solrj.SolrServerException:Server
> refused connection at:
> http://192.168.78.239:8984/solr,192.168.78.239:8984_solr=org.apache.solr.client.solrj.SolrServerException:Server
> refused connection at:
> http://192.168.78.239:8984/solr,192.168.78.239:8985_solr=org.apache.solr.client.solrj.SolrServerException:Server
> refused connection at:
> http://192.168.78.239:8985/solr},success={192.168.78.239:8983_solr={responseHeader={status=0,QTime=305}},192.168.78.239:8983_solr={responseHeader={status=0,QTime=260}}}}
> {code}
> * Please note, that consecutive attempts to collection-delete "mycoll" will
> fail, because Solr claims that "mycoll" does not exist.
> * I stopped the Solrs again
> * Collected stdouterr files (see attached coll_delete_problem.zip)
> In this scenario you see that because you send the delete-collection request
> while some Solrs have not completely started yet, you will end up in a
> situation where it seems like the collection has been deleted, but
> data-folders are still left on disk taking up disk-space. The most
> significant thing is that this happens even though then client (sending the
> delete-request) waits until all Solrs are live and all shards of the
> collection to be deleted claim to be active. What more can a careful client
> do?
> *In this particular case, where you specifically wait for solrs to be live
> and shards active, I think we should make sure that everything is deleted
> (including folders) correctly*
> I am not looking for a bullet-proof solution. I believe we can always some up
> with crazy scenarios where you end up with a half deleted collection. But
> this particular scenario should work, I believe.
> Please note, that I have seen other scenarios where only parts of the stuff
> in clusterstate is deleted (try removing the parts about waiting for active
> shards in CollDelete.java - so that you are only waiting for live Solrs),
> they just seem to be harder to reproduce consistently. But the fact that you
> can have such situations also, might help when designing the robust solution.
> Please also note, that I tested this on 4.7.2 because it is the latest java6
> enabled release and I only had java6 on my machine. One of my colleagues have
> tested 4.8.1 on a machine with java7 - no difference.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]