[
https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated SOLR-8914:
---------------------------
Attachment: live_node_mentions_port56361_with_threadIds.log.txt
Having read more of {{ZkStateReader}}, and understanding a bit better when/why
these various log messages are written, it now seems certain to me that
ZkStateReader's callbacks to watching {{live_nodes}} and updating that
information in the (local) {{ClusterState}} is *absolutely not thread safe*.
Key things to note about the ZkStateReader code related to live nodes and live
nodes log msgs...
* The following are examples of log message formats that occur when a ZK
{{Watcher}}'s process method is triggered - in all of these cases the # of live
nodes logged is the _current_ number, according to local state, prior to
processing the Watcher event...
** LegacyClusterStateWatcher (old multi-collections state.json)
*** {{"A cluster state change: \[\{\}\] for collection \[\{\}\] has occurred -
updating... (live nodes size: \[\{\}\])"}}
*** _NOTE: this is the current (local) {{liveNodes.size()}}, w/ no attempt to
refresh {{liveNodes}} from ZK_
** StateWatcher (per collection state.json)
*** {{"A cluster state change: \[\{\}\] for collection \[\{\}\] has occurred -
updating... (live nodes size: \[\{\}\])"}}
*** _NOTE: this is the current (local) {{liveNodes.size()}}, w/ no attempt to
refresh {{liveNodes}} from ZK_
** LiveNodeWatcher
*** {{"A live node change: \[\{\}\], has occurred - updating... (live nodes
size: \[\{\}\])"}}
*** *NOTE: This the "old" value when a LiveNodesWatcher is triggered, prior to
refreshing from ZK*
* The following is what gets logged _after_ the local cache of the live nodes
data has been updated (either by an explicit method call to fetch the data from
ZK on startup, or because a LiveNodeWatcher was triggered by ZK)
** {{"Updated live nodes from ZooKeeper... (\{\}) -> (\{\})"}}
* The full code for how the local cache of {{liveNode}} is refreshed from ZK
(either by an explifit method call on startup or because a LiveNodeWatcher was
triggered) is...{code}
try {
List<String> nodeList = zkClient.getChildren(LIVE_NODES_ZKNODE, watcher,
true);
newLiveNodes = new HashSet<>(nodeList);
} catch (KeeperException.NoNodeException e) {
newLiveNodes = emptySet();
}
Set<String> oldLiveNodes;
synchronized (getUpdateLock()) {
oldLiveNodes = this.liveNodes;
this.liveNodes = newLiveNodes;
if (clusterState != null) {
clusterState.setLiveNodes(newLiveNodes);
}
}
LOG.info("Updated live nodes from ZooKeeper... ({}) -> ({})",
oldLiveNodes.size(), newLiveNodes.size());
{code}
With all of that info in mind if we revisi the log file, and focus on the
threadIds that log each message for a single port we know is problematic
(56361)...
{code}
grep -i "live nodes" jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt |
grep 127.0.0.1:56361_solr | perl -ple 's/.*2>
(\d+).*?\((.*?-\d+)(?:-processing-n:127.*?)?\).*o.a.s.c.c.ZkStateReader(.*)/$1
$2 $3/' > live_node_mentions_port56361_with_threadIds.log.txt
{code}
...it becomes really easy to spot the problem...
{noformat}
1290004 jetty-launcher-1118-thread-5 Updated live nodes from ZooKeeper... (0)
-> (0)
1290033 zkCallback-1135-thread-1 A live node change: [WatchedEvent
state:SyncConnected type:NodeChildrenChanged path:/live_nodes], has occurred -
updating... (live nodes size: [0])
1290047 zkCallback-1135-thread-2 A live node change: [WatchedEvent
state:SyncConnected type:NodeChildrenChanged path:/live_nodes], has occurred -
updating... (live nodes size: [0])
1290052 zkCallback-1135-thread-3 A live node change: [WatchedEvent
state:SyncConnected type:NodeChildrenChanged path:/live_nodes], has occurred -
updating... (live nodes size: [0])
1290070 zkCallback-1135-thread-3 Updated live nodes from ZooKeeper... (0) ->
(4)
1290071 zkCallback-1135-thread-3 A live node change: [WatchedEvent
state:SyncConnected type:NodeChildrenChanged path:/live_nodes], has occurred -
updating... (live nodes size: [4])
1290071 zkCallback-1135-thread-3 Updated live nodes from ZooKeeper... (4) ->
(5)
1290075 zkCallback-1135-thread-1 Updated live nodes from ZooKeeper... (5) ->
(1)
1290076 zkCallback-1135-thread-2 Updated live nodes from ZooKeeper... (1) ->
(3)
{noformat}
* timestamp #1290004 is when jetty is first starting up, and forcibly asks ZK
for the list of live nodes.
* we then see 3 threads being spun up in numeric sequence order, to (start)
processing LiveNodeWatcher events
* thread-3 finishes and updates the live nodes to a list of size "4"
* thread-3 is then re-used to (start) processing another LiveNodeWatcher event
* thread-3 finishes again and updates the live nodes to a list of size "5"
* thread-1 now finally finishes, and overwrites the live node list with a list
of size "1" *even though it's data is almost certainly old*
* thread-1 now finally finishes, and also overwrites the live node list with a
list of size "3" *even though this data is almost certainly also old*
Filling in the blanks, a valid sequence of events that fits the code and
matches the logs could have been (colors matchg T# thread id)...
* {color:blue}T1 {{process()}} called, logs "A live node change: ..."{color}
* {color:blue}T1 {{zkClient.getChildren(...)}} returns a list of 1 live
nodes{color}
* {color:red}T2 {{process()}} called, logs "A live node change: ..."{color}
* {color:green}T3 {{process()}} called, logs "A live node change: ..."{color}
* {color:red}T2 {{zkClient.getChildren(...)}} returns a list of 3 live
nodes{color}
* {color:green}T3 {{zkClient.getChildren(...)}} returns a list of 4 live
nodes{color}
* {color:green}T3 {{clusterState.setLiveNodes("4")}} called, logs "Updated live
nodes..."{color}
* {color:green}T3 {{process()}} is now finished, thread is free for
re-use{color}
* {color:green}T3 {{process()}} called, logs "A live node change: ..."{color}
* {color:green}T3 {{zkClient.getChildren(...)}} returns a list of 5 live
nodes{color}
* {color:green}T3 {{clusterState.setLiveNodes("5")}} called, logs "Updated live
nodes..."{color}
* {color:green}T3 {{process()}} is now finished, thread is free for
re-use{color}
* {color:blue}T1 {{clusterState.setLiveNodes("1")}} called, logs "Updated live
nodes..."{color}
* {color:blue}T1 {{process()}} is now finished, thread is free for re-use{color}
* {color:red}T2 {{clusterState.setLiveNodes("3")}} called, logs "Updated live
nodes..."{color}
* {color:red}T2 {{process()}} is now finished, thread is free for re-use{color}
_(NOTE: one thing i'm not clear on is why there are only 4 {{live_nodes}}
NodeChildrenChanged WatchedEvents triggered - i would expect 5 total unless ZK
doesn't garuntee a unique event for each new child node, does it send a single
event if 2 children are added very close together in time?)_
After the sequence of events above, port #56361 is hozed. In a real world
situation, it will never update it's local state to match the _real_ list of
live nodes in ZK unless some other node goes up/down triggering the
LiveNodesWatcher.
The rest of the "live node" log messages in our log file are either passively
reporting the local cached data during other ZK events (specifically:
collection creation trigging the state.json watcher) or they are the triggers
to the LiveNodeWatcher when MiniSolrCloudCluster starts shutting down nodes
after the test has failed.
> inexplicable "no servers hosting shard: shard2" using MiniSolrCloudCluster
> --------------------------------------------------------------------------
>
> Key: SOLR-8914
> URL: https://issues.apache.org/jira/browse/SOLR-8914
> Project: Solr
> Issue Type: Bug
> Reporter: Hoss Man
> Attachments: jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt,
> live_node_mentions_port56361_with_threadIds.log.txt,
> live_nodes_mentions.log.txt
>
>
> Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the
> weekend....
> {noformat}
> http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText
> Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7
> (refs/remotes/origin/branch_6x)
> Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC
> {noformat}
> The failure happened during the static setup of the test, when a
> MiniSolrCloudCluster & several clients are initialized -- before any code
> related to TolerantUpdateProcessor is ever used.
> I can't reproduce this, or really make sense of what i'm (not) seeing here in
> the logs, so i'm filing this jira with my analysis in the hopes that someone
> else can help make sense of it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]