[ 
https://issues.apache.org/jira/browse/SOLR-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14952153#comment-14952153
 ] 

Shalin Shekhar Mangar commented on SOLR-8152:
---------------------------------------------

Okay I understand now. So we first create the response node as an 
EPHEMERAL_SEQUENTIAL and then use its sequence ID to create the persistent 
request node. Sounds good to me. Thanks for explaining.

> Overseer Task Processor/Queue can miss responses, leading to timeouts
> ---------------------------------------------------------------------
>
>                 Key: SOLR-8152
>                 URL: https://issues.apache.org/jira/browse/SOLR-8152
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Gregory Chanan
>            Assignee: Gregory Chanan
>         Attachments: SOLR-8152.patch
>
>
> I noticed some jenkins reports of timeouts in the 
> TestConfigSetsAPIExclusivityTest, which seemed strange given the amount of 
> work to be done is small and the timeout generous at 300 seconds.
> I added some statistics gathering and started beasting the test and sure 
> enough, some tests reported tasks taking slightly more than 300 seconds, 
> while most tests ran with a maximum task run of less than a second.  This 
> suggested something was hanging until the timeout.
> Some investigation lead to this code:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L179-L194
> There appears to be a few issues here:
> {code}
>  String path = createData(dir + "/" + PREFIX, data,
>           CreateMode.PERSISTENT_SEQUENTIAL);
>       String watchID = createData(
>           dir + "/" + response_prefix + path.substring(path.lastIndexOf("-") 
> + 1),
>           null, CreateMode.EPHEMERAL);
>       Object lock = new Object();
>       LatchWatcher watcher = new LatchWatcher(lock);
>       synchronized (lock) {
>         if (zookeeper.exists(watchID, watcher, true) != null) {
>           watcher.await(timeout);
>         }
>       }
> {code}
> For one, the request object is created before the response object.  If the 
> request is quickly picked up and processed, two things can happen:
> 1) The response is written before the watch is set, which means we wait until 
> the timeout even though the response is ready.  This will still pass the test 
> because the response is available, the client will just wait needlessly.
> 2) The response is attempted to be written before the response node is even 
> created.  The fact that the response node doesn't exist is ignored:
> https://github.com/apache/lucene-solr/blob/80a73535b20debb1717c6f7f11e08fc311833c88/solr/core/src/java/org/apache/solr/cloud/OverseerTaskQueue.java#L92-L94
> In this case, the task is processed but the client will actually see a 
> failure because there is no response.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to