[
https://issues.apache.org/jira/browse/SOLR-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200194#comment-15200194
]
Hoss Man edited comment on SOLR-8862 at 3/17/16 10:01 PM:
--
Ok, so here's what i've found so far...
* Just adding a single line of logging to my test after {{configureCluster}}
and before {{cluster.createCollection}} was enough to make the seed start
passing fairly reliably.
** so clearly a finicky timing problem
* {{MiniSolrCloudCluster}}'s constructor has logic that waits for
{{/live_nodes}} have {{numServer}} children before returning
** this was added in SOLR-7146 precisely because of problems like the one i'm
seeing
** if there aren't the expected number of {{/live_nodes}} the first time it
checks, then it sleeps in 1 second increments until there are.
* {{/live_nodes}} get's populated by {{ZkController.createEphemeralLiveNode}}
** -*THIS METHOD IS SUSPICIOUSLY CALLED IN TWO DIFF PLACES:*-
**# EDIT: this is actualy part of an {{OnReconnect}} handler that I
misconstrued as something that would be called on the initial connect. -fairly
early in the {{ZkController}} constructor-...{code}
// we have to register as live first to pick up docs in the buffer
createEphemeralLiveNode();
{code}
**# again as the very last thing in {{ZkControlle.init}}...{code}
// Do this last to signal we're up.
createEphemeralLiveNode();
{code}...this line+comment added in recently in SOLR-8696 when it replaced
another previously existing call to {{createEphemeralLiveNode}} that was
earlier in the init method (see
https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=commitdiff;h=8ac4fdd;hp=7d32456efa4ade0130c3ed0ae677aa47b29355a9
)
* Even if {{/live_nodes}} were only populated as the very last line in
{{ZkController.init}}, that's far from the last thing that happens when a solr
node starts up. Things that happen after {{ZkController}} is initialized but
before {{CoreContainer.createAndLoad}} returns and the {{SolrDispatchFilter}}
starts accepting requests:
** {{ZkContainer.initZooKeeper}}...
*** whatever the hell this is suppose to do...{code}
if (zkRun != null && zkServer.getServers().size() > 1 && confDir == null &&
boostrapConf == false) {
// we are part of an ensemble and we are not uploading the config - pause to
give the config time
// to get up
Thread.sleep(1);
}
{code}
*** any node that has a confDir uploads it to zk:
{{configManager.uploadConfigDir(configPath, confName);}} (even if it's not
bootstrapping???)
*** any node that *IS* doing bootstrap does that:
{{ZkController.bootstrapConf(zkController.getZkClient(), cc, solrHome);}}
** {{CoreContainer.load()}}...
*** Authentication plugins are initialized
*** core * collection & configset & container handlers are initialized
*** *{{CoreDescriptor}} FOR EACH CORE DIR ON DISK ARE LOADED*
which of course means opening transaction logs, opening indexwriters, open
searchers, newSearcher event listeners, etc...
*** {{ZkController.checkOverseerDesignate()}} is called (no idea what that does)
Which all leads me to the following conclusions...
# when using {{MiniSolrCloudCluster}}, if you are lucky, there will be at least
one node not yet in {{/live_nodes} when it does it's first check, and then it
will sleep 1 second giving those nodes time to _actually_ startup & load their
cores, and hopefully at least one of them will be completley finished by the
time you actaully try to use a {{CloudSolrClient}} pointed at that ZK
{{/live_nodes}} data.
# unless there is some other "i'm alive" data in ZK that
{{MiniSolrCloudCluster}} should be consulting, it seems like it's doing the
best it can to ensure that all the nodes are live before returning to the caller
# *This does not seem like a probably that only affects tests.* This seems
like a real world problem we shoudl address -- {{CloudSolrClient}} should be
able to consult some info in ZK to know when a node is _really_ alive and ready
for requests.
#* if there is a reason why the {{/live_nodes}} entry needs to be created as
early as it is (ie: {{// we have to register as live first to pick up docs in
the buffer}}) then it should only be created that one time and some other
ephemeral node should be used
#* whatever ephemeral node is used should be created by a very explicit very
special method call made as the very last thing in {{SolrDispatchFilter}}
was (Author: hossman):
Ok, so here's what i've found so far...
* Just adding a single line of logging to my test after {{configureCluster}}
and before {{cluster.createCollection}} was enough to make the seed start
passing fairly reliably.
** so clearly a finicky timing problem
* {{MiniSolrCloudCluster}}'s constructor has logic that waits for
{{/live_nodes}} have {{numServer}} children before returning
** this was added in SOLR-7146 precisely because of problems like the one i'm
seeing
** if th