[
https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217011#comment-15217011
]
Scott Blum commented on SOLR-8914:
----------------------------------
Hoss-- great analysis! I was just looking at this code myself. The part I
don't understand is why this watcher is getting fired multiple times on
different threads. I (re)wrote some of this code, and one of my implicit
assumptions that was that any given watcher would not get re-fired until the
previous watcher invocation had returned. But maybe that was a really bad
assumption that I carried over from Curator, or perhaps the thread model in ZK
has changed?
Either way, I completely agree that this is broken if we can get multiple
concurrent ZK callbacks on the same watcher.
getUpdateLock() is an overly broad lock, we should not try to hold that lock
while calling ZK.
I have a few proposals on how to fix:
1) Add a getChildren() variant that returns a Stat object, and use version
tracking. I could make an argument that this is the most sane way to break any
versioning ties, since it completely eliminates client-side concurrency issues
in the ZK framework code. If we don't do this, we're always implicitly
depending on our ability to linearly sequence getChildren calls.
2) Make the various Watcher objects lock themselves for the duration of
process(WatchedEvent event). That way, subsequent callbacks will have to
linearize.
3) Similar idea, but put that same set of locks as top-level fields in
ZkStateReader. This is like #2, but also protects against a multiple-watchers
scenario. I've tried to avoid situations of multiple watchers, there's no real
guard against multiple calls to createClusterStateWatchersAndUpdate().
> ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
> ------------------------------------------------------------
>
> Key: SOLR-8914
> URL: https://issues.apache.org/jira/browse/SOLR-8914
> Project: Solr
> Issue Type: Bug
> Reporter: Hoss Man
> Attachments: jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt,
> live_node_mentions_port56361_with_threadIds.log.txt,
> live_nodes_mentions.log.txt
>
>
> Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the
> weekend....
> {noformat}
> http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText
> Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7
> (refs/remotes/origin/branch_6x)
> Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC
> {noformat}
> The failure happened during the static setup of the test, when a
> MiniSolrCloudCluster & several clients are initialized -- before any code
> related to TolerantUpdateProcessor is ever used.
> I can't reproduce this, or really make sense of what i'm (not) seeing here in
> the logs, so i'm filing this jira with my analysis in the hopes that someone
> else can help make sense of it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]