[ 
https://issues.apache.org/jira/browse/SOLR-4099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raintung Li updated SOLR-4099:
------------------------------

    Description: 
In test environment, our zookeeper version is old that our requirement version. 
Not use solr default 3.3.6 version.

The overseer collection processor stop work. Trace the dump, the overseer wait 
for LatchChildWatcher.await. 
Check the zookeeper /overseer/collection-queue-work, block a lot of operation 
for collection. 

Check the logic, suspect the zookeeper client doesn't call back the watchevent 
that register the path "/overseer/collection-queue-work", unlucky the log level 
is debug. 

This case doesn't happen often, very little. But if it happen, it is fatal, we 
have to stop the leader server.

Suggest the compensate solution, that doesn't await until notify. Only wait 
some time that maybe it is ten minutes or a half of hour or other value to 
recheck the queue again. Of cause if get the notify, that can direct work 
normal.



  was:
In test environment, our zookeeper version is old that our requirement version. 
Not use solr default 3.3.6 version.

The overseer collection processor stop work. Trace the dump, the overseer wait 
for LatchChildWatcher.await. 
Check the zookeeper /overseer/collection-queue-work, block a lot of operation 
for collection. 

Check the logic, suspect the zookeeper client doesn't call back the watchevent 
that register the path "/overseer/collection-queue-work", unlucky the log level 
is debug. 

This case doesn't happen often, very little. But if it happen, it is fatal, we 
have to stop the leader server.

We suggest the compensate solution, that doesn't await until notify. Only wait 
some time that maybe it is ten minutes or a half of hour or other value to 
recheck the queue again. Of cause if get the notify, that can direct work 
normal.



    
> Suspect zookeeper client thread doesn't call back the watcher, that occur the 
> overseer collection can't work normal.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-4099
>                 URL: https://issues.apache.org/jira/browse/SOLR-4099
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 4.0-ALPHA, 4.0-BETA, 4.0
>         Environment: Zookeeper version: 3.2
>            Reporter: Raintung Li
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> In test environment, our zookeeper version is old that our requirement 
> version. Not use solr default 3.3.6 version.
> The overseer collection processor stop work. Trace the dump, the overseer 
> wait for LatchChildWatcher.await. 
> Check the zookeeper /overseer/collection-queue-work, block a lot of operation 
> for collection. 
> Check the logic, suspect the zookeeper client doesn't call back the 
> watchevent that register the path "/overseer/collection-queue-work", unlucky 
> the log level is debug. 
> This case doesn't happen often, very little. But if it happen, it is fatal, 
> we have to stop the leader server.
> Suggest the compensate solution, that doesn't await until notify. Only wait 
> some time that maybe it is ten minutes or a half of hour or other value to 
> recheck the queue again. Of cause if get the notify, that can direct work 
> normal.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to