[ 
https://issues.apache.org/jira/browse/SOLR-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200194#comment-15200194
 ] 

Hoss Man edited comment on SOLR-8862 at 3/17/16 10:01 PM:
----------------------------------------------------------

Ok, so here's what i've found so far...

* Just adding a single line of logging to my test after {{configureCluster}} 
and before {{cluster.createCollection}} was enough to make the seed start 
passing fairly reliably.
** so clearly a finicky timing problem
* {{MiniSolrCloudCluster}}'s constructor has logic that waits for 
{{/live_nodes}} have {{numServer}} children before returning
** this was added in SOLR-7146 precisely because of problems like the one i'm 
seeing
** if there aren't the expected number of {{/live_nodes}} the first time it 
checks, then it sleeps in 1 second increments until there are.
* {{/live_nodes}} get's populated by {{ZkController.createEphemeralLiveNode}}
** -*THIS METHOD IS SUSPICIOUSLY CALLED IN TWO DIFF PLACES:*-
**# EDIT: this is actualy part of an {{OnReconnect}} handler that I 
misconstrued as something that would be called on the initial connect. -fairly 
early in the {{ZkController}} constructor-...{code}
// we have to register as live first to pick up docs in the buffer
createEphemeralLiveNode();
{code}
**# again as the very last thing in {{ZkControlle.init}}...{code}
// Do this last to signal we're up.
createEphemeralLiveNode();
{code}...this line+comment added in recently in SOLR-8696 when it replaced 
another previously existing call to {{createEphemeralLiveNode}} that was 
earlier in the init method (see 
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=commitdiff;h=8ac4fdd;hp=7d32456efa4ade0130c3ed0ae677aa47b29355a9
 )
* Even if {{/live_nodes}} were only populated as the very last line in 
{{ZkController.init}}, that's far from the last thing that happens when a solr 
node starts up. Things that happen after {{ZkController}} is initialized but 
before {{CoreContainer.createAndLoad}} returns and the {{SolrDispatchFilter}} 
starts accepting requests:
** {{ZkContainer.initZooKeeper}}...
*** whatever the hell this is suppose to do...{code}
if (zkRun != null && zkServer.getServers().size() > 1 && confDir == null && 
boostrapConf == false) {
  // we are part of an ensemble and we are not uploading the config - pause to 
give the config time
  // to get up
  Thread.sleep(10000);
}
{code}
*** any node that has a confDir uploads it to zk: 
{{configManager.uploadConfigDir(configPath, confName);}} (even if it's not 
bootstrapping???)
*** any node that *IS* doing bootstrap does that: 
{{ZkController.bootstrapConf(zkController.getZkClient(), cc, solrHome);}}
** {{CoreContainer.load()}}...
*** Authentication plugins are initialized
*** core * collection & configset & container handlers are initialized
*** *{{CoreDescriptor}} FOR EACH CORE DIR ON DISK ARE LOADED*
**** which of course means opening transaction logs, opening indexwriters, open 
searchers, newSearcher event listeners, etc...
*** {{ZkController.checkOverseerDesignate()}} is called (no idea what that does)


Which all leads me to the following conclusions...

# when using {{MiniSolrCloudCluster}}, if you are lucky, there will be at least 
one node not yet in {{/live_nodes} when it does it's first check, and then it 
will sleep 1 second giving those nodes time to _actually_ startup & load their 
cores, and hopefully at least one of them will be completley finished by the 
time you actaully try to use a {{CloudSolrClient}} pointed at that ZK 
{{/live_nodes}} data.
# unless there is some other "i'm alive" data in ZK that 
{{MiniSolrCloudCluster}} should be consulting, it seems like it's doing the 
best it can to ensure that all the nodes are live before returning to the caller
# *This does not seem like a probably that only affects tests.*  This seems 
like a real world problem we shoudl address -- {{CloudSolrClient}} should be 
able to consult some info in ZK to know when a node is _really_ alive and ready 
for requests.
#* if there is a reason why the {{/live_nodes}} entry needs to be created as 
early as it is (ie: {{// we have to register as live first to pick up docs in 
the buffer}}) then it should only be created that one time and some other 
ephemeral node should be used
#* whatever ephemeral node is used should be created by a very explicit very 
special method call made as the very last thing in {{SolrDispatchFilter}}



was (Author: hossman):

Ok, so here's what i've found so far...

* Just adding a single line of logging to my test after {{configureCluster}} 
and before {{cluster.createCollection}} was enough to make the seed start 
passing fairly reliably.
** so clearly a finicky timing problem
* {{MiniSolrCloudCluster}}'s constructor has logic that waits for 
{{/live_nodes}} have {{numServer}} children before returning
** this was added in SOLR-7146 precisely because of problems like the one i'm 
seeing
** if there aren't the expected number of {{/live_nodes}} the first time it 
checks, then it sleeps in 1 second increments until there are.
* {{/live_nodes}} get's populated by {{ZkController.createEphemeralLiveNode}}
** *THIS METHOD IS SUSPICIOUSLY CALLED IN TWO DIFF PLACES:*
**# fairly early in the {{ZkController}} constructor...{code}
// we have to register as live first to pick up docs in the buffer
createEphemeralLiveNode();
{code}
**# again as the very last thing in {{ZkControlle.init}}...{code}
// Do this last to signal we're up.
createEphemeralLiveNode();
{code}...this line+comment added in recently in SOLR-8696 when it replaced 
another previously existing call to {{createEphemeralLiveNode}} that was 
earlier in the init method (see 
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=commitdiff;h=8ac4fdd;hp=7d32456efa4ade0130c3ed0ae677aa47b29355a9
 )
* Even if {{/live_nodes}} were only populated as the very last line in 
{{ZkController.init}}, that's far from the last thing that happens when a solr 
node starts up. Things that happen after {{ZkController}} is initialized but 
before {{CoreContainer.createAndLoad}} returns and the {{SolrDispatchFilter}} 
starts accepting requests:
** {{ZkContainer.initZooKeeper}}...
*** whatever the hell this is suppose to do...{code}
if (zkRun != null && zkServer.getServers().size() > 1 && confDir == null && 
boostrapConf == false) {
  // we are part of an ensemble and we are not uploading the config - pause to 
give the config time
  // to get up
  Thread.sleep(10000);
}
{code}
*** any node that has a confDir uploads it to zk: 
{{configManager.uploadConfigDir(configPath, confName);}} (even if it's not 
bootstrapping???)
*** any node that *IS* doing bootstrap does that: 
{{ZkController.bootstrapConf(zkController.getZkClient(), cc, solrHome);}}
** {{CoreContainer.load()}}...
*** Authentication plugins are initialized
*** core * collection & configset & container handlers are initialized
*** *{{CoreDescriptor}} FOR EACH CORE DIR ON DISK ARE LOADED*
**** which of course means opening transaction logs, opening indexwriters, open 
searchers, newSearcher event listeners, etc...
*** {{ZkController.checkOverseerDesignate()}} is called (no idea what that does)


Which all leads me to the following conclusions...

# when using {{MiniSolrCloudCluster}}, if you are lucky, there will be at least 
one node not yet in {{/live_nodes} when it does it's first check, and then it 
will sleep 1 second giving those nodes time to _actually_ startup & load their 
cores, and hopefully at least one of them will be completley finished by the 
time you actaully try to use a {{CloudSolrClient}} pointed at that ZK 
{{/live_nodes}} data.
# unless there is some other "i'm alive" data in ZK that 
{{MiniSolrCloudCluster}} should be consulting, it seems like it's doing the 
best it can to ensure that all the nodes are live before returning to the caller
# *This does not seem like a probably that only affects tests.*  This seems 
like a real world problem we shoudl address -- {{CloudSolrClient}} should be 
able to consult some info in ZK to know when a node is _really_ alive and ready 
for requests.
#* if there is a reason why the {{/live_nodes}} entry needs to be created as 
early as it is (ie: {{// we have to register as live first to pick up docs in 
the buffer}}) then it should only be created that one time and some other 
ephemeral node should be used
#* whatever ephemeral node is used should be created by a very explicit very 
special method call made as the very last thing in {{SolrDispatchFilter}}


> /live_nodes is populated too early to be very useful for clients -- 
> CloudSolrClient (and MiniSolrCloudCluster.createCollection) need some other 
> ephemeral zk node to knowwhich servers are "ready"
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-8862
>                 URL: https://issues.apache.org/jira/browse/SOLR-8862
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>
> {{/live_nodes}} is populated surprisingly early (and multiple times) in the 
> life cycle of a sole node startup, and as a result probably shouldn't be used 
> by {{CloudSolrClient}} (or other "smart" clients) for deciding what servers 
> are fair game for requests.
> we should either fix {{/live_nodes}} to be created later in the lifecycle, or 
> add some new ZK node for this purpose.
> {panel:title=original bug report}
> I haven't been able to make sense of this yet, but what i'm seeing in a new 
> SolrCloudTestCase subclass i'm writing is that the code below, which 
> (reasonably) attempts to create a collection immediately after configuring 
> the MiniSolrCloudCluster gets a "SolrServerException: No live SolrServers 
> available to handle this request" -- in spite of the fact, that (as far as i 
> can tell at first glance) MiniSolrCloudCluster's constructor is suppose to 
> block until all the servers are live..
> {code}
>     configureCluster(numServers)
>       .addConfig(configName, configDir.toPath())
>       .configure();
>     Map<String, String> collectionProperties = ...;
>     assertNotNull(cluster.createCollection(COLLECTION_NAME, numShards, 
> repFactor,
>                                            configName, null, null, 
> collectionProperties));
> {code}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to