[ 
https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hoss Man updated SOLR-8914:
---------------------------
    Attachment: live_node_mentions_port56361_with_threadIds.log.txt


Having read more of {{ZkStateReader}}, and understanding a bit better when/why 
these various log messages are written, it now seems certain to me that 
ZkStateReader's callbacks to watching {{live_nodes}} and updating that 
information in the (local) {{ClusterState}} is *absolutely not thread safe*.

Key things to note about the ZkStateReader code related to live nodes and live 
nodes log msgs...

* The following are examples of log message formats that occur when a ZK 
{{Watcher}}'s process method is triggered - in all of these cases the # of live 
nodes logged is the _current_ number, according to local state, prior to 
processing the Watcher event...
** LegacyClusterStateWatcher (old multi-collections state.json)
*** {{"A cluster state change: \[\{\}\] for collection \[\{\}\] has occurred - 
updating... (live nodes size: \[\{\}\])"}}
*** _NOTE: this is the current (local) {{liveNodes.size()}}, w/ no attempt to 
refresh {{liveNodes}} from ZK_
** StateWatcher (per collection state.json)
*** {{"A cluster state change: \[\{\}\] for collection \[\{\}\] has occurred - 
updating... (live nodes size: \[\{\}\])"}}
*** _NOTE: this is the current (local) {{liveNodes.size()}}, w/ no attempt to 
refresh {{liveNodes}} from ZK_
** LiveNodeWatcher
*** {{"A live node change: \[\{\}\], has occurred - updating... (live nodes 
size: \[\{\}\])"}}
*** *NOTE: This the "old" value when a LiveNodesWatcher is triggered, prior to 
refreshing from ZK*
* The following is what gets logged _after_ the local cache of the live nodes 
data has been updated (either by an explicit method call to fetch the data from 
ZK on startup, or because a LiveNodeWatcher was triggered by ZK)
** {{"Updated live nodes from ZooKeeper... (\{\}) -> (\{\})"}}
* The full code for how the local cache of {{liveNode}} is refreshed from ZK 
(either by an explifit method call on startup or because a LiveNodeWatcher was 
triggered) is...{code}
try {
  List<String> nodeList = zkClient.getChildren(LIVE_NODES_ZKNODE, watcher, 
true);
  newLiveNodes = new HashSet<>(nodeList);
} catch (KeeperException.NoNodeException e) {
  newLiveNodes = emptySet();
}
Set<String> oldLiveNodes;
synchronized (getUpdateLock()) {
  oldLiveNodes = this.liveNodes;
  this.liveNodes = newLiveNodes;
  if (clusterState != null) {
    clusterState.setLiveNodes(newLiveNodes);
  }
}
LOG.info("Updated live nodes from ZooKeeper... ({}) -> ({})", 
oldLiveNodes.size(), newLiveNodes.size());
{code}


With all of that info in mind if we revisi the log file, and focus on the 
threadIds that log each message for a single port we know is problematic 
(56361)...

{code}
grep -i "live nodes" jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt | 
grep 127.0.0.1:56361_solr | perl -ple 's/.*2> 
(\d+).*?\((.*?-\d+)(?:-processing-n:127.*?)?\).*o.a.s.c.c.ZkStateReader(.*)/$1 
$2 $3/' > live_node_mentions_port56361_with_threadIds.log.txt
{code}

...it becomes really easy to spot the problem...

{noformat}
1290004 jetty-launcher-1118-thread-5  Updated live nodes from ZooKeeper... (0) 
-> (0)
1290033 zkCallback-1135-thread-1  A live node change: [WatchedEvent 
state:SyncConnected type:NodeChildrenChanged path:/live_nodes], has occurred - 
updating... (live nodes size: [0])
1290047 zkCallback-1135-thread-2  A live node change: [WatchedEvent 
state:SyncConnected type:NodeChildrenChanged path:/live_nodes], has occurred - 
updating... (live nodes size: [0])
1290052 zkCallback-1135-thread-3  A live node change: [WatchedEvent 
state:SyncConnected type:NodeChildrenChanged path:/live_nodes], has occurred - 
updating... (live nodes size: [0])
1290070 zkCallback-1135-thread-3  Updated live nodes from ZooKeeper... (0) -> 
(4)
1290071 zkCallback-1135-thread-3  A live node change: [WatchedEvent 
state:SyncConnected type:NodeChildrenChanged path:/live_nodes], has occurred - 
updating... (live nodes size: [4])
1290071 zkCallback-1135-thread-3  Updated live nodes from ZooKeeper... (4) -> 
(5)
1290075 zkCallback-1135-thread-1  Updated live nodes from ZooKeeper... (5) -> 
(1)
1290076 zkCallback-1135-thread-2  Updated live nodes from ZooKeeper... (1) -> 
(3)
{noformat}

* timestamp #1290004 is when jetty is first starting up, and forcibly asks ZK 
for the list of live nodes.
* we then see 3 threads being spun up in numeric sequence order, to (start) 
processing LiveNodeWatcher events
* thread-3 finishes and updates the live nodes to a list of size "4"
* thread-3 is then re-used to (start) processing another LiveNodeWatcher event
* thread-3 finishes again and updates the live nodes to a list of size "5"
* thread-1 now finally finishes, and overwrites the live node list with a list 
of size "1" *even though it's data is almost certainly old*
* thread-1 now finally finishes, and also overwrites the live node list with a 
list of size "3" *even though this data is almost certainly also old*

Filling in the blanks, a valid sequence of events that fits the code and 
matches the logs could have been (colors matchg T# thread id)...

* {color:blue}T1 {{process()}} called, logs "A live node change: ..."{color}
* {color:blue}T1 {{zkClient.getChildren(...)}} returns a list of 1 live 
nodes{color}
* {color:red}T2 {{process()}} called, logs "A live node change: ..."{color}
* {color:green}T3 {{process()}} called, logs "A live node change: ..."{color}
* {color:red}T2 {{zkClient.getChildren(...)}} returns a list of 3 live 
nodes{color}
* {color:green}T3 {{zkClient.getChildren(...)}} returns a list of 4 live 
nodes{color}
* {color:green}T3 {{clusterState.setLiveNodes("4")}} called, logs "Updated live 
nodes..."{color}
* {color:green}T3 {{process()}} is now finished, thread is free for 
re-use{color}
* {color:green}T3 {{process()}} called, logs "A live node change: ..."{color}
* {color:green}T3 {{zkClient.getChildren(...)}} returns a list of 5 live 
nodes{color}
* {color:green}T3 {{clusterState.setLiveNodes("5")}} called, logs "Updated live 
nodes..."{color}
* {color:green}T3 {{process()}} is now finished, thread is free for 
re-use{color}
* {color:blue}T1 {{clusterState.setLiveNodes("1")}} called, logs "Updated live 
nodes..."{color}
* {color:blue}T1 {{process()}} is now finished, thread is free for re-use{color}
* {color:red}T2 {{clusterState.setLiveNodes("3")}} called, logs "Updated live 
nodes..."{color}
* {color:red}T2 {{process()}} is now finished, thread is free for re-use{color}

_(NOTE: one thing i'm not clear on is why there are only 4 {{live_nodes}} 
NodeChildrenChanged WatchedEvents triggered - i would expect 5 total unless ZK 
doesn't garuntee a unique event for each new child node, does it send a single 
event if 2 children are added very close together in time?)_

After the sequence of events above, port #56361 is hozed. In a real world 
situation, it will never update it's local state to match the _real_ list of 
live nodes in ZK unless some other node goes up/down triggering the 
LiveNodesWatcher.

The rest of the "live node" log messages in our log file are either passively 
reporting the local cached data during other ZK events (specifically: 
collection creation trigging the state.json watcher) or they are the triggers 
to the LiveNodeWatcher when MiniSolrCloudCluster starts shutting down nodes 
after the test has failed.







> inexplicable "no servers hosting shard: shard2" using MiniSolrCloudCluster
> --------------------------------------------------------------------------
>
>                 Key: SOLR-8914
>                 URL: https://issues.apache.org/jira/browse/SOLR-8914
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>         Attachments: jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, 
> live_node_mentions_port56361_with_threadIds.log.txt, 
> live_nodes_mentions.log.txt
>
>
> Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the 
> weekend....
> {noformat}
> http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText
> Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 
> (refs/remotes/origin/branch_6x)
> Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC
> {noformat}
> The failure happened during the static setup of the test, when a 
> MiniSolrCloudCluster & several clients are initialized -- before any code 
> related to TolerantUpdateProcessor is ever used.
> I can't reproduce this, or really make sense of what i'm (not) seeing here in 
> the logs, so i'm filing this jira with my analysis in the hopes that someone 
> else can help make sense of it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to