[
https://issues.apache.org/jira/browse/SOLR-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200194#comment-15200194
]
Hoss Man edited comment on SOLR-8862 at 3/17/16 10:01 PM:
----------------------------------------------------------
Ok, so here's what i've found so far...
* Just adding a single line of logging to my test after {{configureCluster}}
and before {{cluster.createCollection}} was enough to make the seed start
passing fairly reliably.
** so clearly a finicky timing problem
* {{MiniSolrCloudCluster}}'s constructor has logic that waits for
{{/live_nodes}} have {{numServer}} children before returning
** this was added in SOLR-7146 precisely because of problems like the one i'm
seeing
** if there aren't the expected number of {{/live_nodes}} the first time it
checks, then it sleeps in 1 second increments until there are.
* {{/live_nodes}} get's populated by {{ZkController.createEphemeralLiveNode}}
** -*THIS METHOD IS SUSPICIOUSLY CALLED IN TWO DIFF PLACES:*-
**# EDIT: this is actualy part of an {{OnReconnect}} handler that I
misconstrued as something that would be called on the initial connect. -fairly
early in the {{ZkController}} constructor-...{code}
// we have to register as live first to pick up docs in the buffer
createEphemeralLiveNode();
{code}
**# again as the very last thing in {{ZkControlle.init}}...{code}
// Do this last to signal we're up.
createEphemeralLiveNode();
{code}...this line+comment added in recently in SOLR-8696 when it replaced
another previously existing call to {{createEphemeralLiveNode}} that was
earlier in the init method (see
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=commitdiff;h=8ac4fdd;hp=7d32456efa4ade0130c3ed0ae677aa47b29355a9
)
* Even if {{/live_nodes}} were only populated as the very last line in
{{ZkController.init}}, that's far from the last thing that happens when a solr
node starts up. Things that happen after {{ZkController}} is initialized but
before {{CoreContainer.createAndLoad}} returns and the {{SolrDispatchFilter}}
starts accepting requests:
** {{ZkContainer.initZooKeeper}}...
*** whatever the hell this is suppose to do...{code}
if (zkRun != null && zkServer.getServers().size() > 1 && confDir == null &&
boostrapConf == false) {
// we are part of an ensemble and we are not uploading the config - pause to
give the config time
// to get up
Thread.sleep(10000);
}
{code}
*** any node that has a confDir uploads it to zk:
{{configManager.uploadConfigDir(configPath, confName);}} (even if it's not
bootstrapping???)
*** any node that *IS* doing bootstrap does that:
{{ZkController.bootstrapConf(zkController.getZkClient(), cc, solrHome);}}
** {{CoreContainer.load()}}...
*** Authentication plugins are initialized
*** core * collection & configset & container handlers are initialized
*** *{{CoreDescriptor}} FOR EACH CORE DIR ON DISK ARE LOADED*
**** which of course means opening transaction logs, opening indexwriters, open
searchers, newSearcher event listeners, etc...
*** {{ZkController.checkOverseerDesignate()}} is called (no idea what that does)
Which all leads me to the following conclusions...
# when using {{MiniSolrCloudCluster}}, if you are lucky, there will be at least
one node not yet in {{/live_nodes} when it does it's first check, and then it
will sleep 1 second giving those nodes time to _actually_ startup & load their
cores, and hopefully at least one of them will be completley finished by the
time you actaully try to use a {{CloudSolrClient}} pointed at that ZK
{{/live_nodes}} data.
# unless there is some other "i'm alive" data in ZK that
{{MiniSolrCloudCluster}} should be consulting, it seems like it's doing the
best it can to ensure that all the nodes are live before returning to the caller
# *This does not seem like a probably that only affects tests.* This seems
like a real world problem we shoudl address -- {{CloudSolrClient}} should be
able to consult some info in ZK to know when a node is _really_ alive and ready
for requests.
#* if there is a reason why the {{/live_nodes}} entry needs to be created as
early as it is (ie: {{// we have to register as live first to pick up docs in
the buffer}}) then it should only be created that one time and some other
ephemeral node should be used
#* whatever ephemeral node is used should be created by a very explicit very
special method call made as the very last thing in {{SolrDispatchFilter}}
was (Author: hossman):
Ok, so here's what i've found so far...
* Just adding a single line of logging to my test after {{configureCluster}}
and before {{cluster.createCollection}} was enough to make the seed start
passing fairly reliably.
** so clearly a finicky timing problem
* {{MiniSolrCloudCluster}}'s constructor has logic that waits for
{{/live_nodes}} have {{numServer}} children before returning
** this was added in SOLR-7146 precisely because of problems like the one i'm
seeing
** if there aren't the expected number of {{/live_nodes}} the first time it
checks, then it sleeps in 1 second increments until there are.
* {{/live_nodes}} get's populated by {{ZkController.createEphemeralLiveNode}}
** *THIS METHOD IS SUSPICIOUSLY CALLED IN TWO DIFF PLACES:*
**# fairly early in the {{ZkController}} constructor...{code}
// we have to register as live first to pick up docs in the buffer
createEphemeralLiveNode();
{code}
**# again as the very last thing in {{ZkControlle.init}}...{code}
// Do this last to signal we're up.
createEphemeralLiveNode();
{code}...this line+comment added in recently in SOLR-8696 when it replaced
another previously existing call to {{createEphemeralLiveNode}} that was
earlier in the init method (see
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=commitdiff;h=8ac4fdd;hp=7d32456efa4ade0130c3ed0ae677aa47b29355a9
)
* Even if {{/live_nodes}} were only populated as the very last line in
{{ZkController.init}}, that's far from the last thing that happens when a solr
node starts up. Things that happen after {{ZkController}} is initialized but
before {{CoreContainer.createAndLoad}} returns and the {{SolrDispatchFilter}}
starts accepting requests:
** {{ZkContainer.initZooKeeper}}...
*** whatever the hell this is suppose to do...{code}
if (zkRun != null && zkServer.getServers().size() > 1 && confDir == null &&
boostrapConf == false) {
// we are part of an ensemble and we are not uploading the config - pause to
give the config time
// to get up
Thread.sleep(10000);
}
{code}
*** any node that has a confDir uploads it to zk:
{{configManager.uploadConfigDir(configPath, confName);}} (even if it's not
bootstrapping???)
*** any node that *IS* doing bootstrap does that:
{{ZkController.bootstrapConf(zkController.getZkClient(), cc, solrHome);}}
** {{CoreContainer.load()}}...
*** Authentication plugins are initialized
*** core * collection & configset & container handlers are initialized
*** *{{CoreDescriptor}} FOR EACH CORE DIR ON DISK ARE LOADED*
**** which of course means opening transaction logs, opening indexwriters, open
searchers, newSearcher event listeners, etc...
*** {{ZkController.checkOverseerDesignate()}} is called (no idea what that does)
Which all leads me to the following conclusions...
# when using {{MiniSolrCloudCluster}}, if you are lucky, there will be at least
one node not yet in {{/live_nodes} when it does it's first check, and then it
will sleep 1 second giving those nodes time to _actually_ startup & load their
cores, and hopefully at least one of them will be completley finished by the
time you actaully try to use a {{CloudSolrClient}} pointed at that ZK
{{/live_nodes}} data.
# unless there is some other "i'm alive" data in ZK that
{{MiniSolrCloudCluster}} should be consulting, it seems like it's doing the
best it can to ensure that all the nodes are live before returning to the caller
# *This does not seem like a probably that only affects tests.* This seems
like a real world problem we shoudl address -- {{CloudSolrClient}} should be
able to consult some info in ZK to know when a node is _really_ alive and ready
for requests.
#* if there is a reason why the {{/live_nodes}} entry needs to be created as
early as it is (ie: {{// we have to register as live first to pick up docs in
the buffer}}) then it should only be created that one time and some other
ephemeral node should be used
#* whatever ephemeral node is used should be created by a very explicit very
special method call made as the very last thing in {{SolrDispatchFilter}}
> /live_nodes is populated too early to be very useful for clients --
> CloudSolrClient (and MiniSolrCloudCluster.createCollection) need some other
> ephemeral zk node to knowwhich servers are "ready"
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-8862
> URL: https://issues.apache.org/jira/browse/SOLR-8862
> Project: Solr
> Issue Type: Bug
> Reporter: Hoss Man
>
> {{/live_nodes}} is populated surprisingly early (and multiple times) in the
> life cycle of a sole node startup, and as a result probably shouldn't be used
> by {{CloudSolrClient}} (or other "smart" clients) for deciding what servers
> are fair game for requests.
> we should either fix {{/live_nodes}} to be created later in the lifecycle, or
> add some new ZK node for this purpose.
> {panel:title=original bug report}
> I haven't been able to make sense of this yet, but what i'm seeing in a new
> SolrCloudTestCase subclass i'm writing is that the code below, which
> (reasonably) attempts to create a collection immediately after configuring
> the MiniSolrCloudCluster gets a "SolrServerException: No live SolrServers
> available to handle this request" -- in spite of the fact, that (as far as i
> can tell at first glance) MiniSolrCloudCluster's constructor is suppose to
> block until all the servers are live..
> {code}
> configureCluster(numServers)
> .addConfig(configName, configDir.toPath())
> .configure();
> Map<String, String> collectionProperties = ...;
> assertNotNull(cluster.createCollection(COLLECTION_NAME, numShards,
> repFactor,
> configName, null, null,
> collectionProperties));
> {code}
> {panel}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]