[ 
https://issues.apache.org/jira/browse/BOOKKEEPER-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147562#comment-13147562
 ] 

Sijie Guo commented on BOOKKEEPER-69:
-------------------------------------

Ivan, thanks for your comments.

After checking the code again, I found that my analysis has some problems. The 
acquire/release op in TopicManager is executed one by one. 

And after reading your patch, I found that the real cause of race condition is 
between acquire and release. If we can follow the following patterns, we can 
avoid this race condition. 
1) topic is put in topic list only after persistence manager and subscription 
manager acquire topic successfully.
2) topic is removed from topic list only after persistence manager and 
subscription manager released topic.

now, 1) is guaranteed, 2) is not guaranteed (your patch makes sure callback is 
triggered after all managers release topic).

I will reading codes again to confirm my thoughts, and comment later.
                
> ServerRedirectLoopException when a machine (hosts bookie server & hub server) 
> reboot, which is caused by race condition of topic manager
> ----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-69
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-69
>             Project: Bookkeeper
>          Issue Type: Bug
>          Components: hedwig-client, hedwig-server
>    Affects Versions: 4.0.0
>         Environment: 3 machines (perf8, perf9, perf10), each machine hosts a 
> bookie server & a hub server.
> perf8 is used as default server for client 1. perf9 is used as default server 
> for client 2.
> bookkeeper is configured as below:
> ensemble size is 3, quorum size is 2.
>            Reporter: Sijie Guo
>            Assignee: Sijie Guo
>            Priority: Critical
>             Fix For: 4.0.0
>
>         Attachments: BOOKKEEPER-69.possiblefix.diff, 
> bookkeeper-69-testcase.patch, bookkeeper-69.patch, bookkeeper-69.patch
>
>
> 1) machine perf10 is rebooted. the bookie server & hub server are not 
> restarted automatically after reboot.
> 2) client 1 & client 2 are still running. the topics owned in perf10 will be 
> re-assigned to perf8/perf9. but they would fail because not enough bookie 
> servers are available.
> 3) after 2 hours, we found that perf10 is rebooted. we restarted bookie 
> server & hub server on perf10
> 4) then we got ServerRedirectLoopException in client.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to