Re: Reestablishing a Solr node that ran on a completely crashed machine
On 6/18/13 2:15 PM, Mark Miller wrote: I don't know what the best method to use now is, but the slightly longer term plan is to: * Have a new mode where you cannot preconfigure cores, only use the collection's API. * ZK becomes the cluster state truth. * The Overseer takes actions to ensure cores live/die in different places based on the truth in ZK. Not that we have to decide on this now, but I guess in my scenario I do not see why the Overseer should be involved. The replica is already assigned to run on the replaced machine with a specific IP/hostname (actually a specific Solr node-name), so I guess that the Solr node itself on this new/replaced machine should just go look in ZK when it starts up and realize that it ought to run this and that replica and start loading them itself. I recognize that the Overseer should/could be involved in relocating replica for different reasons - loadbalancing, rack-awareness etc. But in cases where a replica is already assigned to a certain node-name according to ZK state, but the node is not preconfigured (in solr.xml) to run this replica, the node itself should just realize that it ought to run it anyway and load it. But it probably have to be thought through well. Just my immediate thoughts. - Mark Regards, Per Steffensen - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Reestablishing a Solr node that ran on a completely crashed machine
On Jun 19, 2013, at 2:20 AM, Per Steffensen st...@designware.dk wrote: On 6/18/13 2:15 PM, Mark Miller wrote: I don't know what the best method to use now is, but the slightly longer term plan is to: * Have a new mode where you cannot preconfigure cores, only use the collection's API. * ZK becomes the cluster state truth. * The Overseer takes actions to ensure cores live/die in different places based on the truth in ZK. Not that we have to decide on this now, but I guess in my scenario I do not see why the Overseer should be involved. The replica is already assigned to run on the replaced machine with a specific IP/hostname (actually a specific Solr node-name), so I guess that the Solr node itself on this new/replaced machine should just go look in ZK when it starts up and realize that it ought to run this and that replica and start loading them itself. I recognize that the Overseer should/could be involved in relocating replica for different reasons - loadbalancing, rack-awareness etc. But in cases where a replica is already assigned to a certain node-name according to ZK state, but the node is not preconfigured (in solr.xml) to run this replica, the node itself should just realize that it ought to run it anyway and load it. But it probably have to be thought through well. Just my immediate thoughts. Specific node names have since been essentially deprecated - auto assigned generic node names are what we have transitioned to. You should easily be able to host a shard with a machine that has a different address without confusion. By and large, the Overseer will be able too assume responsibility for assignments (though I'm sure how much it will do will be configurable) at a high level. It will be able to do things like look at maxShardsPerNode and replicationFactor and periodically follow rules to make adjustments. The Overseer being in charge is more a conceptual idea though, not the implementation. When a core starts up and checks with ZK and sees the collection it belongs to no longer exists or something, it likely to just not load rather than wait for an Overseer to spot and it remove it later. - Mark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Reestablishing a Solr node that ran on a completely crashed machine
Hi Scenario: * 1) You have a Solr cloud cluster running - several Solr nodes across several machine - many collections with many replica and documents indexed into them * 2) One of the machines running a Solr node completely crashes - totally gone including local disk with data/config etc. of the Solr node * 3) You want to be able to insert a new empty machine, install/configure Solr on this new machine, give it the same IP and hostname as the crashed machine had, and then we want to be able to start this new Solr node and have it take the place of the crashed Solr node, making the Solr cloud cluster work again * 4) No replication (only one replica per shard), so we will accept that the data on the crashed machine is gone forever, but of course we want the Solr cloud cluster to continue running with the documents indexed on the other Solr nodes At my company we are establishing a procedure for what to do in 3) above. Basically we use our install script to install/configure the new Solr node on the new machine as it was originally installed/configured on the crashed machine back when the system was originally set up - this includes an empty solr.xml file (no cores mentioned). Now starting all the Solr nodes (including the new reestablished one) again. They all start successfully but the Solr cloud cluster does not work - at least when doing distributed searches touching replica that used to run on the crashed Solr node, because those replica in not loaded on the reestablished node. How to make sure a reestablished Solr node on a machine with same IP and hostname as on the machine that crashed will load all the replica that the old Solr node used to run? Potential solutions * We have tried to make sure that the solr.xml on the reestablished Solr node is containing the same core-list as on the crashed one. Then everything works as we want. But this is a little fragile and it is a solution outside Solr - you need to figure out how to reestablish the solr.xml yourself - probably something like looking into clusterstate.json and generate the solr.xml from that * Untested by us: Maybe we will also succeed just running Core API LOAD operations against the new reestablished Solr node - a LOAD operation for each replica that used to run on the Solr node. But this is also a little fragile and it is also (partly) a solution outside Solr - you need to figure out which cores to load yourself. I have to say that we do not use the latest Solr version - we use a version of Solr based on 4.0.0. So there might be a solution already in Solr, but I would be surprised. Any thoughts about how this ought to be done? Support in Solr? E.g. an operation to tell a Solr node to load all the replica that used to run on a machine with the same IP and hostname? Or...? Regards, Per Steffensen - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Reestablishing a Solr node that ran on a completely crashed machine
I don't know what the best method to use now is, but the slightly longer term plan is to: * Have a new mode where you cannot preconfigure cores, only use the collection's API. * ZK becomes the cluster state truth. * The Overseer takes actions to ensure cores live/die in different places based on the truth in ZK. - Mark On Jun 18, 2013, at 6:03 AM, Per Steffensen st...@designware.dk wrote: Hi Scenario: * 1) You have a Solr cloud cluster running - several Solr nodes across several machine - many collections with many replica and documents indexed into them * 2) One of the machines running a Solr node completely crashes - totally gone including local disk with data/config etc. of the Solr node * 3) You want to be able to insert a new empty machine, install/configure Solr on this new machine, give it the same IP and hostname as the crashed machine had, and then we want to be able to start this new Solr node and have it take the place of the crashed Solr node, making the Solr cloud cluster work again * 4) No replication (only one replica per shard), so we will accept that the data on the crashed machine is gone forever, but of course we want the Solr cloud cluster to continue running with the documents indexed on the other Solr nodes At my company we are establishing a procedure for what to do in 3) above. Basically we use our install script to install/configure the new Solr node on the new machine as it was originally installed/configured on the crashed machine back when the system was originally set up - this includes an empty solr.xml file (no cores mentioned). Now starting all the Solr nodes (including the new reestablished one) again. They all start successfully but the Solr cloud cluster does not work - at least when doing distributed searches touching replica that used to run on the crashed Solr node, because those replica in not loaded on the reestablished node. How to make sure a reestablished Solr node on a machine with same IP and hostname as on the machine that crashed will load all the replica that the old Solr node used to run? Potential solutions * We have tried to make sure that the solr.xml on the reestablished Solr node is containing the same core-list as on the crashed one. Then everything works as we want. But this is a little fragile and it is a solution outside Solr - you need to figure out how to reestablish the solr.xml yourself - probably something like looking into clusterstate.json and generate the solr.xml from that * Untested by us: Maybe we will also succeed just running Core API LOAD operations against the new reestablished Solr node - a LOAD operation for each replica that used to run on the Solr node. But this is also a little fragile and it is also (partly) a solution outside Solr - you need to figure out which cores to load yourself. I have to say that we do not use the latest Solr version - we use a version of Solr based on 4.0.0. So there might be a solution already in Solr, but I would be surprised. Any thoughts about how this ought to be done? Support in Solr? E.g. an operation to tell a Solr node to load all the replica that used to run on a machine with the same IP and hostname? Or...? Regards, Per Steffensen - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Reestablishing a Solr node that ran on a completely crashed machine
Ok, thanks. I think we will just reconstruct solr.xml (from clusterstate.json) ourselves for now. On 6/18/13 2:15 PM, Mark Miller wrote: I don't know what the best method to use now is, but the slightly longer term plan is to: * Have a new mode where you cannot preconfigure cores, only use the collection's API. * ZK becomes the cluster state truth. * The Overseer takes actions to ensure cores live/die in different places based on the truth in ZK. - Mark On Jun 18, 2013, at 6:03 AM, Per Steffensen st...@designware.dk wrote: Hi Scenario: * 1) You have a Solr cloud cluster running - several Solr nodes across several machine - many collections with many replica and documents indexed into them * 2) One of the machines running a Solr node completely crashes - totally gone including local disk with data/config etc. of the Solr node * 3) You want to be able to insert a new empty machine, install/configure Solr on this new machine, give it the same IP and hostname as the crashed machine had, and then we want to be able to start this new Solr node and have it take the place of the crashed Solr node, making the Solr cloud cluster work again * 4) No replication (only one replica per shard), so we will accept that the data on the crashed machine is gone forever, but of course we want the Solr cloud cluster to continue running with the documents indexed on the other Solr nodes At my company we are establishing a procedure for what to do in 3) above. Basically we use our install script to install/configure the new Solr node on the new machine as it was originally installed/configured on the crashed machine back when the system was originally set up - this includes an empty solr.xml file (no cores mentioned). Now starting all the Solr nodes (including the new reestablished one) again. They all start successfully but the Solr cloud cluster does not work - at least when doing distributed searches touching replica that used to run on the crashed Solr node, because those replica in not loaded on the reestablished node. How to make sure a reestablished Solr node on a machine with same IP and hostname as on the machine that crashed will load all the replica that the old Solr node used to run? Potential solutions * We have tried to make sure that the solr.xml on the reestablished Solr node is containing the same core-list as on the crashed one. Then everything works as we want. But this is a little fragile and it is a solution outside Solr - you need to figure out how to reestablish the solr.xml yourself - probably something like looking into clusterstate.json and generate the solr.xml from that * Untested by us: Maybe we will also succeed just running Core API LOAD operations against the new reestablished Solr node - a LOAD operation for each replica that used to run on the Solr node. But this is also a little fragile and it is also (partly) a solution outside Solr - you need to figure out which cores to load yourself. I have to say that we do not use the latest Solr version - we use a version of Solr based on 4.0.0. So there might be a solution already in Solr, but I would be surprised. Any thoughts about how this ought to be done? Support in Solr? E.g. an operation to tell a Solr node to load all the replica that used to run on a machine with the same IP and hostname? Or...? Regards, Per Steffensen - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Reestablishing a Solr node that ran on a completely crashed machine
Hi, Re ZK becomes the cluster state truth. I thought that was already the case, no? Who/what else holds (which) bits of the total truth? Thanks, Otis On Tue, Jun 18, 2013 at 8:15 AM, Mark Miller markrmil...@gmail.com wrote: I don't know what the best method to use now is, but the slightly longer term plan is to: * Have a new mode where you cannot preconfigure cores, only use the collection's API. * ZK becomes the cluster state truth. * The Overseer takes actions to ensure cores live/die in different places based on the truth in ZK. - Mark On Jun 18, 2013, at 6:03 AM, Per Steffensen st...@designware.dk wrote: Hi Scenario: * 1) You have a Solr cloud cluster running - several Solr nodes across several machine - many collections with many replica and documents indexed into them * 2) One of the machines running a Solr node completely crashes - totally gone including local disk with data/config etc. of the Solr node * 3) You want to be able to insert a new empty machine, install/configure Solr on this new machine, give it the same IP and hostname as the crashed machine had, and then we want to be able to start this new Solr node and have it take the place of the crashed Solr node, making the Solr cloud cluster work again * 4) No replication (only one replica per shard), so we will accept that the data on the crashed machine is gone forever, but of course we want the Solr cloud cluster to continue running with the documents indexed on the other Solr nodes At my company we are establishing a procedure for what to do in 3) above. Basically we use our install script to install/configure the new Solr node on the new machine as it was originally installed/configured on the crashed machine back when the system was originally set up - this includes an empty solr.xml file (no cores mentioned). Now starting all the Solr nodes (including the new reestablished one) again. They all start successfully but the Solr cloud cluster does not work - at least when doing distributed searches touching replica that used to run on the crashed Solr node, because those replica in not loaded on the reestablished node. How to make sure a reestablished Solr node on a machine with same IP and hostname as on the machine that crashed will load all the replica that the old Solr node used to run? Potential solutions * We have tried to make sure that the solr.xml on the reestablished Solr node is containing the same core-list as on the crashed one. Then everything works as we want. But this is a little fragile and it is a solution outside Solr - you need to figure out how to reestablish the solr.xml yourself - probably something like looking into clusterstate.json and generate the solr.xml from that * Untested by us: Maybe we will also succeed just running Core API LOAD operations against the new reestablished Solr node - a LOAD operation for each replica that used to run on the Solr node. But this is also a little fragile and it is also (partly) a solution outside Solr - you need to figure out which cores to load yourself. I have to say that we do not use the latest Solr version - we use a version of Solr based on 4.0.0. So there might be a solution already in Solr, but I would be surprised. Any thoughts about how this ought to be done? Support in Solr? E.g. an operation to tell a Solr node to load all the replica that used to run on a machine with the same IP and hostname? Or...? Regards, Per Steffensen - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Reestablishing a Solr node that ran on a completely crashed machine
With preconfigurable cores, each node with cores also holds some truth. You might have a core registered in zk but it doesn't exist on a node. You might have a core that is not registered in zk, but does on a node. A core that comes up might be a really old node coming back or it might be a user that pre configured a new core. Without preconfigurable cores, the Overseer can adjust for these things and make ZK the truth by fiat. - Mark On Jun 18, 2013, at 8:50 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Re ZK becomes the cluster state truth. I thought that was already the case, no? Who/what else holds (which) bits of the total truth? Thanks, Otis On Tue, Jun 18, 2013 at 8:15 AM, Mark Miller markrmil...@gmail.com wrote: I don't know what the best method to use now is, but the slightly longer term plan is to: * Have a new mode where you cannot preconfigure cores, only use the collection's API. * ZK becomes the cluster state truth. * The Overseer takes actions to ensure cores live/die in different places based on the truth in ZK. - Mark On Jun 18, 2013, at 6:03 AM, Per Steffensen st...@designware.dk wrote: Hi Scenario: * 1) You have a Solr cloud cluster running - several Solr nodes across several machine - many collections with many replica and documents indexed into them * 2) One of the machines running a Solr node completely crashes - totally gone including local disk with data/config etc. of the Solr node * 3) You want to be able to insert a new empty machine, install/configure Solr on this new machine, give it the same IP and hostname as the crashed machine had, and then we want to be able to start this new Solr node and have it take the place of the crashed Solr node, making the Solr cloud cluster work again * 4) No replication (only one replica per shard), so we will accept that the data on the crashed machine is gone forever, but of course we want the Solr cloud cluster to continue running with the documents indexed on the other Solr nodes At my company we are establishing a procedure for what to do in 3) above. Basically we use our install script to install/configure the new Solr node on the new machine as it was originally installed/configured on the crashed machine back when the system was originally set up - this includes an empty solr.xml file (no cores mentioned). Now starting all the Solr nodes (including the new reestablished one) again. They all start successfully but the Solr cloud cluster does not work - at least when doing distributed searches touching replica that used to run on the crashed Solr node, because those replica in not loaded on the reestablished node. How to make sure a reestablished Solr node on a machine with same IP and hostname as on the machine that crashed will load all the replica that the old Solr node used to run? Potential solutions * We have tried to make sure that the solr.xml on the reestablished Solr node is containing the same core-list as on the crashed one. Then everything works as we want. But this is a little fragile and it is a solution outside Solr - you need to figure out how to reestablish the solr.xml yourself - probably something like looking into clusterstate.json and generate the solr.xml from that * Untested by us: Maybe we will also succeed just running Core API LOAD operations against the new reestablished Solr node - a LOAD operation for each replica that used to run on the Solr node. But this is also a little fragile and it is also (partly) a solution outside Solr - you need to figure out which cores to load yourself. I have to say that we do not use the latest Solr version - we use a version of Solr based on 4.0.0. So there might be a solution already in Solr, but I would be surprised. Any thoughts about how this ought to be done? Support in Solr? E.g. an operation to tell a Solr node to load all the replica that used to run on a machine with the same IP and hostname? Or...? Regards, Per Steffensen - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Reestablishing a Solr node that ran on a completely crashed machine
I see. Thanks for the explanation. B, yeah, ZK should be the one and only brain there, I think. And forget Fiat, go for Mercedes. Otis On Tue, Jun 18, 2013 at 10:24 AM, Mark Miller markrmil...@gmail.com wrote: With preconfigurable cores, each node with cores also holds some truth. You might have a core registered in zk but it doesn't exist on a node. You might have a core that is not registered in zk, but does on a node. A core that comes up might be a really old node coming back or it might be a user that pre configured a new core. Without preconfigurable cores, the Overseer can adjust for these things and make ZK the truth by fiat. - Mark On Jun 18, 2013, at 8:50 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Re ZK becomes the cluster state truth. I thought that was already the case, no? Who/what else holds (which) bits of the total truth? Thanks, Otis On Tue, Jun 18, 2013 at 8:15 AM, Mark Miller markrmil...@gmail.com wrote: I don't know what the best method to use now is, but the slightly longer term plan is to: * Have a new mode where you cannot preconfigure cores, only use the collection's API. * ZK becomes the cluster state truth. * The Overseer takes actions to ensure cores live/die in different places based on the truth in ZK. - Mark On Jun 18, 2013, at 6:03 AM, Per Steffensen st...@designware.dk wrote: Hi Scenario: * 1) You have a Solr cloud cluster running - several Solr nodes across several machine - many collections with many replica and documents indexed into them * 2) One of the machines running a Solr node completely crashes - totally gone including local disk with data/config etc. of the Solr node * 3) You want to be able to insert a new empty machine, install/configure Solr on this new machine, give it the same IP and hostname as the crashed machine had, and then we want to be able to start this new Solr node and have it take the place of the crashed Solr node, making the Solr cloud cluster work again * 4) No replication (only one replica per shard), so we will accept that the data on the crashed machine is gone forever, but of course we want the Solr cloud cluster to continue running with the documents indexed on the other Solr nodes At my company we are establishing a procedure for what to do in 3) above. Basically we use our install script to install/configure the new Solr node on the new machine as it was originally installed/configured on the crashed machine back when the system was originally set up - this includes an empty solr.xml file (no cores mentioned). Now starting all the Solr nodes (including the new reestablished one) again. They all start successfully but the Solr cloud cluster does not work - at least when doing distributed searches touching replica that used to run on the crashed Solr node, because those replica in not loaded on the reestablished node. How to make sure a reestablished Solr node on a machine with same IP and hostname as on the machine that crashed will load all the replica that the old Solr node used to run? Potential solutions * We have tried to make sure that the solr.xml on the reestablished Solr node is containing the same core-list as on the crashed one. Then everything works as we want. But this is a little fragile and it is a solution outside Solr - you need to figure out how to reestablish the solr.xml yourself - probably something like looking into clusterstate.json and generate the solr.xml from that * Untested by us: Maybe we will also succeed just running Core API LOAD operations against the new reestablished Solr node - a LOAD operation for each replica that used to run on the Solr node. But this is also a little fragile and it is also (partly) a solution outside Solr - you need to figure out which cores to load yourself. I have to say that we do not use the latest Solr version - we use a version of Solr based on 4.0.0. So there might be a solution already in Solr, but I would be surprised. Any thoughts about how this ought to be done? Support in Solr? E.g. an operation to tell a Solr node to load all the replica that used to run on a machine with the same IP and hostname? Or...? Regards, Per Steffensen - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org - To unsubscribe, e-mail: