[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14262144#comment-14262144 ] ASF GitHub Bot commented on SOLR-6336: -- Github user andyetitmoves closed the pull request at: https://github.com/apache/lucene-solr/pull/81 DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller Fix For: 4.10, Trunk The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174420#comment-14174420 ] Jessica Cheng Mallet commented on SOLR-6336: Please let me know if I'm supposed to open a new issue (not sure what the policy is). I'm encountering a bug from this patch where now I'm stuck in a loop making getChildren() request to zookeeper with this thread dump: {quote} Thread-51 [WAITING] CPU time: 1d 15h 0m 57s java.lang.Object.wait() org.apache.zookeeper.ClientCnxn.submitRequest(RequestHeader, Record, Record, ZooKeeper$WatchRegistration) org.apache.zookeeper.ZooKeeper.getChildren(String, Watcher) org.apache.solr.common.cloud.SolrZkClient$6.execute()2 recursive calls org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkOperation) org.apache.solr.common.cloud.SolrZkClient.getChildren(String, Watcher, boolean) org.apache.solr.cloud.DistributedQueue.orderedChildren(Watcher) org.apache.solr.cloud.DistributedQueue.getChildren(long) org.apache.solr.cloud.DistributedQueue.peek(long) org.apache.solr.cloud.DistributedQueue.peek(boolean) org.apache.solr.cloud.Overseer$ClusterStateUpdater.run() java.lang.Thread.run() {quote} Looking at the code, I think the issue is that LatchChildWatcher#process always sets the event to its member, regardless of its type, but the problem is that once an event is set, the await no longer waits. In this state, the while loop in getChildren(long), when called with wait being Integer.MAX_VALUE will come back, NOT wait at await because event != null, but then it still will not get any children. {quote} while (true) { if (!children.isEmpty()) break; watcher.await(wait == Long.MAX_VALUE ? DEFAULT_TIMEOUT : wait); if (watcher.getWatchedEvent() != null) { children = orderedChildren(null); } if (wait != Long.MAX_VALUE) break; } {quote} I think the fix would be to only set the event in the watcher if the type is a NodeChildrenChanged. DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller Fix For: 4.10, Trunk The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174514#comment-14174514 ] Shawn Heisey commented on SOLR-6336: [~mewmewball], because this issue is resolved and you're having what looks to you like a related bug, it's standard practice to open a new issue. Sometimes discussion may continue on an issue after it's resolved, typically that would be for clarification purposes, to decide whether a new issue should be filed. You can link the new issue to this as a related issue after you create it. I do not understand the cloud/zookeeper internals well enough to know what you were saying above. One day I hope to. DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller Fix For: 4.10, Trunk The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174519#comment-14174519 ] Jessica Cheng Mallet commented on SOLR-6336: Thanks for the clarification [~elyograg]! I'll open a new issue. Thanks! DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller Fix For: 4.10, Trunk The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174541#comment-14174541 ] Mark Miller commented on SOLR-6336: --- Best rule of thumb: if the issue is released, new issue and link it to the related one. If it's not released, reopen the issue. DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller Fix For: 4.10, Trunk The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174556#comment-14174556 ] Jessica Cheng Mallet commented on SOLR-6336: Got it! Thanks! DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller Fix For: 4.10, Trunk The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095461#comment-14095461 ] Ramkumar Aiyengar commented on SOLR-6336: - I raised SOLR-6370 to catch such issues proactively. DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller Fix For: 5.0, 4.10 The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090630#comment-14090630 ] ASF GitHub Bot commented on SOLR-6336: -- GitHub user andyetitmoves opened a pull request: https://github.com/apache/lucene-solr/pull/81 Cache children as well so that they can be returned when the watcher is reused Fixes an issue with apache/lucene-solr#80 pulled to SOLR-6336. If the first `getChildren` actually returns nodes, and the second request happens before the watch is fired, currently it will return no children. The rest of the patch is just minor code cleanup. You can merge this pull request into a Git repository by running: $ git pull https://github.com/bloomberg/lucene-solr trunk-reuse-latch-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/81.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #81 commit f603e7506d1fe5d956a75cdb13897b1b7af7ac70 Author: Ramkumar Aiyengar andyetitmo...@gmail.com Date: 2014-08-08T07:44:10Z Cache children as well so that they can be returned when the watcher is reused DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090634#comment-14090634 ] Ramkumar Aiyengar commented on SOLR-6336: - Fixed a bug with children not being returned when the watcher is reused. DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090795#comment-14090795 ] ASF subversion and git services commented on SOLR-6336: --- Commit 1616771 from [~markrmil...@gmail.com] in branch 'dev/trunk' [ https://svn.apache.org/r1616771 ] SOLR-6336: Cache children as well so that they can be returned when the watcher is reused. DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090797#comment-14090797 ] ASF subversion and git services commented on SOLR-6336: --- Commit 1616772 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1616772 ] SOLR-6336: Cache children as well so that they can be returned when the watcher is reused. DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089230#comment-14089230 ] ASF GitHub Bot commented on SOLR-6336: -- GitHub user andyetitmoves opened a pull request: https://github.com/apache/lucene-solr/pull/80 Reuse watcher in DistributedQueue across peek/take Initial patch for SOLR-6336. Some more work can probably be done (other functions in the queue still do this, and probably tests would be good to check this in general), but here's an initial fix which passes tests and fixes Jenkins failures currently happening.. You can merge this pull request into a Git repository by running: $ git pull https://github.com/bloomberg/lucene-solr fix-watch-leak Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/80.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #80 commit d61f5de2056a4ea841c2d637b9def1c2cb8597b0 Author: Ramkumar Aiyengar raiyen...@bloomberg.net Date: 2014-08-07T12:30:22Z Reuse watcher in DistributedQueue across peek/take DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089527#comment-14089527 ] Ramkumar Aiyengar commented on SOLR-6336: - bq. other functions in the queue still do this That's only {{offer}}, which is best managed by the caller as the watcher depends on the arguments passed in. In any case, that's handled reasonably well by CollectionsHandler which uses it. DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090140#comment-14090140 ] ASF subversion and git services commented on SOLR-6336: --- Commit 1616655 from [~markrmil...@gmail.com] in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1616655 ] SOLR-6336: DistributedQueue can easily create too many ZooKeeper Watches. (closes #80) DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090139#comment-14090139 ] ASF subversion and git services commented on SOLR-6336: --- Commit 1616654 from [~markrmil...@gmail.com] in branch 'dev/trunk' [ https://svn.apache.org/r1616654 ] SOLR-6336: DistributedQueue can easily create too many ZooKeeper Watches. (closes #80) DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6336) DistributedQueue (and it's use in OCP) leaks ZK Watches
[ https://issues.apache.org/jira/browse/SOLR-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090164#comment-14090164 ] ASF GitHub Bot commented on SOLR-6336: -- Github user asfgit closed the pull request at: https://github.com/apache/lucene-solr/pull/80 DistributedQueue (and it's use in OCP) leaks ZK Watches --- Key: SOLR-6336 URL: https://issues.apache.org/jira/browse/SOLR-6336 Project: Solr Issue Type: Bug Components: SolrCloud Reporter: Ramkumar Aiyengar Assignee: Mark Miller The current {{DistributedQueue}} implementation leaks ZK watches whenever it finds children or times out on finding one. OCP uses this in its event loop and can loop tight in some conditions (when exclusivity checks fail), leading to lots of watches which get triggered together on the next event (could be a while for some activities like shard splitting). This gets exposed by SOLR-6261 which spawns a new thread for every parallel watch event. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org