[ 
https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217011#comment-15217011
 ] 

Scott Blum commented on SOLR-8914:
----------------------------------

Hoss-- great analysis!  I was just looking at this code myself.  The part I 
don't understand is why this watcher is getting fired multiple times on 
different threads.  I (re)wrote some of this code, and one of my implicit 
assumptions that was that any given watcher would not get re-fired until the 
previous watcher invocation had returned.  But maybe that was a really bad 
assumption that I carried over from Curator, or perhaps the thread model in ZK 
has changed?

Either way, I completely agree that this is broken if we can get multiple 
concurrent ZK callbacks on the same watcher.

getUpdateLock() is an overly broad lock, we should not try to hold that lock 
while calling ZK.

I have a few proposals on how to fix:

1) Add a getChildren() variant that returns a Stat object, and use version 
tracking.  I could make an argument that this is the most sane way to break any 
versioning ties, since it completely eliminates client-side concurrency issues 
in the ZK framework code.  If we don't do this, we're always implicitly 
depending on our ability to linearly sequence getChildren calls.

2) Make the various Watcher objects lock themselves for the duration of 
process(WatchedEvent event).  That way, subsequent callbacks will have to 
linearize.

3) Similar idea, but put that same set of locks as top-level fields in 
ZkStateReader.  This is like #2, but also protects against a multiple-watchers 
scenario.  I've tried to avoid situations of multiple watchers, there's no real 
guard against multiple calls to createClusterStateWatchersAndUpdate().


> ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
> ------------------------------------------------------------
>
>                 Key: SOLR-8914
>                 URL: https://issues.apache.org/jira/browse/SOLR-8914
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>         Attachments: jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, 
> live_node_mentions_port56361_with_threadIds.log.txt, 
> live_nodes_mentions.log.txt
>
>
> Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the 
> weekend....
> {noformat}
> http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText
> Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 
> (refs/remotes/origin/branch_6x)
> Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC
> {noformat}
> The failure happened during the static setup of the test, when a 
> MiniSolrCloudCluster & several clients are initialized -- before any code 
> related to TolerantUpdateProcessor is ever used.
> I can't reproduce this, or really make sense of what i'm (not) seeing here in 
> the logs, so i'm filing this jira with my analysis in the hopes that someone 
> else can help make sense of it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to