[jira] [Commented] (SOLR-6133) More robust collection-delete

Per Steffensen (JIRA) Tue, 03 Jun 2014 23:58:12 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017439#comment-14017439
 ]


Per Steffensen commented on SOLR-6133:
--------------------------------------

In general zk=truth sounds like a great idea :-) But shouldnt zk=truth be 
implicit when either zkRun or zkHost is set?

I am not sure about the terminology, but I believe "unloaded" does not include 
deleting the data from disk? My main concern with the scenario I show is that 
data is not being deleted from disk. We would really like some way to make 
(fairly) sure that data is deleted when we fire a collection-delete request 
(and info disappears from ZK). We have enormous amounts of data and will run 
out of disk-space if we do not have our data-folders deleted.

I am also a little bit concerned about the "on startup" part of "will be 
unloaded on startup". In the scenario I show above, the shards that where 
deleted from zk but not from disk, will pop up in zk again on restart of Solrs 
(because the folders still contain core.properties I believe), and then we get 
a second chance deleting them because we can re-detect that an unwanted 
collection (partly) exists. So if zk=truth means that data will not be deleted, 
and that the shards will not re-appear in zk after restart of Solrs, it is 
actually a step back wrt my main concern. But back to my concern with "on 
startup": Actually we very rarely restart Solrs (because they run fairly stable 
- that is a good thing), so I am concerned with a solution that only "cleans up 
or recovers" on restart.

I am keen on improving collection-delete to do whatever it can to be "all or 
nothing". Will you consider adding to Solr server-side the "check for all nodes 
are live and check for all shards/replica are active before delete" from 
CollDelete.java. This will be a step in the "all or nothing" direction, which 
will be even more important for non SolrJ clients, that really cannot do the 
trick themselves on client-side (unless they do the zk-data joggeling on 
client-side in another way).

> More robust collection-delete
> -----------------------------
>
>                 Key: SOLR-6133
>                 URL: https://issues.apache.org/jira/browse/SOLR-6133
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.7.2, 4.8.1
>            Reporter: Per Steffensen
>         Attachments: CollDelete.java, coll_delete_problem.zip
>
>
> If Solrs are not "stable" (completely up and running etc) a collection-delete 
> request might result in partly deleted collections. You might say that it is 
> fair that you are not able to have a collection deleted if all of its shards 
> are not actively running - even though I would like a mechanism that just 
> deleted them when/if they ever come up again. But even though all shards 
> claim to be actively running you can still end up with partly deleted 
> collections - that is not acceptable IMHO. At least clusterstate should 
> always reflect the state, so that you are able to detect that your 
> collection-delete request was only partly carried out - which parts were 
> successfully deleted and which were not (including information about 
> data-folder-deletion)
> The text above sounds like an epic-sized task, with potentially numerous 
> problems to fix, so in order not to make this ticket "open forever" I will 
> point out a particular scenario where I see problems. Then this problem is 
> corrected we can close this ticket. Other tickets will have to deal with 
> other collection-delete issues.
> Here is what I did and saw
> * Logged into one of my Linux machines with IP 192.168.78.239
> * Prepared for Solr install
> {code}
> mkdir -p /xXX/solr
> cd /xXX/solr
> {code}
> * downloaded solr-4.7.2.tgz
> * Installed Solr 4.7.2 and prepared for three "nodes"
> {code}
> tar zxvf solr-4.7.2.tgz
> cd solr-4.7.2/
> cp -r example node1
> cp -r example node2
> cp -r example node3
> {code}
> * Initialized Solr config into Solr
> {code}
> cd node1
> java -DzkRun -Dhost=192.168.78.239 
> -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf 
> -jar start.jar
> CTRL-C to stop solr (node1) again after it started completely
> {code}
> * Started all three Solr nodes
> {code}
> nohup java -Djetty.port=8983 -Dhost=192.168.78.239 -DzkRun -jar start.jar &>> 
> node1_stdouterr.log &
> cd ../node2
> nohup java -Djetty.port=8984 -Dhost=192.168.78.239 -DzkHost=localhost:9983 
> -jar start.jar &>> node2_stdouterr.log &
> cd ../node3
> nohup java -Djetty.port=8985 -Dhost=192.168.78.239 -DzkHost=localhost:9983 
> -jar start.jar &>> node3_stdouterr.log &
> {code}
> * Created a collection "mycoll"
> {code}
> curl 
> 'http://192.168.78.239:8983/solr/admin/collections?action=CREATE&name=mycoll&numShards=6&replicationFactor=1&maxShardsPerNode=2&collection.configName=myconf'
> {code}
> * Collected "Cloud Graph" image, clusterstate.json and info about data 
> folders (see attached coll_delete_problem.zip | 
> after_create_all_solrs_still_running). You will see that everything is as it 
> is supposed to be. Two shards per node, six all in all, it is all reflected 
> in clusterstate and there is a data-folder for each shard
> * Stopped all three Solr nodes
> {code}
> kill $(ps -ef | grep 8985 | grep -v grep | awk '{print $2}')
> kill $(ps -ef | grep 8984 | grep -v grep | awk '{print $2}')
> kill $(ps -ef | grep 8983 | grep -v grep | awk '{print $2}')
> {code}
> * Started Solr node1 only (wait for it to start completely)
> {code}
> cd ../node1
> nohup java -Djetty.port=8983 -Dhost=192.168.78.239 -DzkRun -jar start.jar &>> 
> node1_stdouterr.log &
> Wait for it to start fully - might take a minute or so
> {code}
> * Collected "Cloud Graph" image, clusterstate.json and info about data 
> folders (see attached coll_delete_problem.zip | 
> after_create_solr1_restarted_solr2_and_3_not_started_yet). You will see that 
> everything is as it is supposed to be. Two shards per node, six all in all, 
> the four on node2 and node3 are down, it is all reflected in clusterstate and 
> there is still a data-folder for each shard
> * Started CollDelete.java (see attached coll_delete_problem.zip) - will 
> delete collection "mycoll" when all three Solrs are live and all shards are 
> "active"
> * Started the remaining two Solr nodes
> {code}
> cd ../node2
> nohup java -Djetty.port=8984 -Dhost=192.168.78.239 -DzkHost=localhost:9983 
> -jar start.jar &>> node2_stdouterr.log &
> cd ../node3
> nohup java -Djetty.port=8985 -Dhost=192.168.78.239 -DzkHost=localhost:9983 
> -jar start.jar &>> node3_stdouterr.log &
> {code}
> * After CollDelete.java finished, collected "Cloud Graph" image, 
> clusterstate.json, info about data folders and output from CollDelete.java 
> (see attached coll_delete_problem.zip | 
> after_create_all_solrs_restarted_delete_coll_while_solr2_and_3_was_starting_up).
>  You will see that not everything is as it is supposed to be. Al info about 
> "mycoll" deleted from clusterstate - ok. But data-folders remain for node2 
> and node3 - not ok.
> ** CollDelete output
> {code}
> All 3 solrs live
> All (6) shards active. Now deleting
> {responseHeader={status=0,QTime=1823},failure={192.168.78.239:8985_solr=org.apache.solr.client.solrj.SolrServerException:Server
>  refused connection at: 
> http://192.168.78.239:8985/solr,192.168.78.239:8984_solr=org.apache.solr.client.solrj.SolrServerException:Server
>  refused connection at: 
> http://192.168.78.239:8984/solr,192.168.78.239:8984_solr=org.apache.solr.client.solrj.SolrServerException:Server
>  refused connection at: 
> http://192.168.78.239:8984/solr,192.168.78.239:8985_solr=org.apache.solr.client.solrj.SolrServerException:Server
>  refused connection at: 
> http://192.168.78.239:8985/solr},success={192.168.78.239:8983_solr={responseHeader={status=0,QTime=305}},192.168.78.239:8983_solr={responseHeader={status=0,QTime=260}}}}
> {code}
> * Please note, that consecutive attempts to collection-delete "mycoll" will 
> fail, because Solr claims that "mycoll" does not exist.
> * I stopped the Solrs again
> * Collected stdouterr files (see attached coll_delete_problem.zip)
> In this scenario you see that because you send the delete-collection request 
> while some Solrs have not completely started yet, you will end up in a 
> situation where it seems like the collection has been deleted, but 
> data-folders are still left on disk taking up disk-space. The most 
> significant thing is that this happens even though then client (sending the 
> delete-request) waits until all Solrs are live and all shards of the 
> collection to be deleted claim to be active. What more can a careful client 
> do?
> *In this particular case, where you specifically wait for solrs to be live 
> and shards active, I think we should make sure that everything is deleted 
> (including folders) correctly*
> I am not looking for a bullet-proof solution. I believe we can always some up 
> with crazy scenarios where you end up with a half deleted collection. But 
> this particular scenario should work, I believe.
> Please note, that I have seen other scenarios where only parts of the stuff 
> in clusterstate is deleted (try removing the parts about waiting for active 
> shards in CollDelete.java - so that you are only waiting for live Solrs), 
> they just seem to be harder to reproduce consistently. But the fact that you 
> can have such situations also, might help when designing the robust solution.
> Please also note, that I tested this on 4.7.2 because it is the latest java6 
> enabled release and I only had java6 on my machine. One of my colleagues have 
> tested 4.8.1 on a machine with java7 - no difference.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-6133) More robust collection-delete

Reply via email to