[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110960#comment-17110960
 ] 

Mate Szalay-Beko edited comment on ZOOKEEPER-3829 at 5/19/20, 12:02 PM:
------------------------------------------------------------------------

I'll push a PR with some proposed fixes and also some tests that reproduces the 
steps we discussed before. (these tests fail without the changes but after the 
changes they are passing). 

see: https://github.com/apache/zookeeper/pull/1356


was (Author: symat):
I'll push a PR with some proposed fixes and also some tests that reproduces the 
steps we discussed before. (these tests fail without the changes but after the 
changes they are passing). 

Still, I don't consider this still as a final solution yet, as e.g. for the 
rolling restart case described in ZOOKEEPER-3814, these changes are causing an 
'infinite loop' when quorum members are sending their notifications between a 
server with the old config and a server with the new config. This is leading to 
large amount of notifications sent between these servers, and the infinite loop 
only broke when all the servers has the new config.

> Zookeeper refuses request after node expansion
> ----------------------------------------------
>
>                 Key: ZOOKEEPER-3829
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.6
>            Reporter: benwang li
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>         Attachments: d.log, screenshot-1.png
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>  
> Step 1. Deploy 3 nodes  A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration  `A,B,C,D` , cluster state is ok 
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will 
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls 
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>  
> {code}
>  
> We have looked into the code of 3.5.6 version, and we found it may be the 
> issue of  `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but 
> `workerPool` still exists. It will never work anymore, yet the cluster still 
> thinks it's ok.
>  
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's 
> ok, please assign this issue to me, and then I'll create a PR. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to