[
https://issues.apache.org/jira/browse/SOLR-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955911#comment-14955911
]
ASF subversion and git services commented on SOLR-8152:
-------------------------------------------------------
Commit 1708539 from [email protected] in branch 'dev/trunk'
[ https://svn.apache.org/r1708539 ]
SOLR-8152: Overseer Task Processor/Queue can miss responses, leading to timeouts
> Overseer Task Processor/Queue can miss responses, leading to timeouts
> ---------------------------------------------------------------------
>
> Key: SOLR-8152
> URL: https://issues.apache.org/jira/browse/SOLR-8152
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud
> Reporter: Gregory Chanan
> Assignee: Gregory Chanan
> Fix For: 5.4, Trunk
>
> Attachments: SOLR-8152.patch
>
>
> I noticed some jenkins reports of timeouts in the
> TestConfigSetsAPIExclusivityTest, which seemed strange given the amount of
> work to be done is small and the timeout generous at 300 seconds.
> I added some statistics gathering and started beasting the test and sure
> enough, some tests reported tasks taking slightly more than 300 seconds,
> while most tests ran with a maximum task run of less than a second. This
> suggested something was hanging until the timeout.
> Some investigation lead to this code:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L179-L194
> There appears to be a few issues here:
> {code}
> String path = createData(dir + "/" + PREFIX, data,
> CreateMode.PERSISTENT_SEQUENTIAL);
> String watchID = createData(
> dir + "/" + response_prefix + path.substring(path.lastIndexOf("-")
> + 1),
> null, CreateMode.EPHEMERAL);
> Object lock = new Object();
> LatchWatcher watcher = new LatchWatcher(lock);
> synchronized (lock) {
> if (zookeeper.exists(watchID, watcher, true) != null) {
> watcher.await(timeout);
> }
> }
> {code}
> For one, the request object is created before the response object. If the
> request is quickly picked up and processed, two things can happen:
> 1) The response is written before the watch is set, which means we wait until
> the timeout even though the response is ready. This will still pass the test
> because the response is available, the client will just wait needlessly.
> 2) The response is attempted to be written before the response node is even
> created. The fact that the response node doesn't exist is ignored:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L92-L94
> In this case, the task is processed but the client will actually see a
> failure because there is no response.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]