[jira] [Issue Comment Edited] (BOOKKEEPER-69) ServerRedirectLoopException when a machine (hosts bookie server & hub server) reboot, which is caused by race condition of topic manager

Sijie Guo (Issue Comment Edited) (JIRA) Sun, 02 Oct 2011 05:27:00 -0700

    [ 
https://issues.apache.org/jira/browse/BOOKKEEPER-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118999#comment-13118999
 ]


Sijie Guo edited comment on BOOKKEEPER-69 at 10/2/11 12:26 PM:
---------------------------------------------------------------

A detailed proposal:


1) introduce TopicStatus to record the status change of topic during topic 
acquisition/releasing.
{quote}
/** Acquire Topic with shouldClaim == true **/
CLAIMING,
CLAIMING_ENQUEUE_CALLBACK,
CLAIMING_GET_ENQUEUED_CALLBACKS,

/** Acquire topic with shouldClaim == false **/
CHOOSING,
CHOOSING_ENQUEUE_CALLBACK,
CHOOSING_GET_ENQUEUED_CALLBACKS,

/** Topic is acquired **/
ACQUIRED,

/** Topic is failed during acquisition **/
ACQUIRE_FAIL_RELEASE,

/** Release Topic **/
RELEASING,
RELEASE_ENQUEUE_CALLBACK,
RELEASE_GET_ENQUEUED_CALLBACKS
{quote}

[CLAIM/CHOOSE/RELEASE]ING : the topic is under claim/choose/release status, 
some one got the chance to do actual work.
[CLAIM/CHOOSE/RELEASE]_ENQUEUE_CALLBACK : there is some on doing 
claim/choose/release works, the op tries to queue callback.
[CLAIM/CHOOSE/RELEASE]_GET_ENQUEUED_CALLBACKS : claim/choose/release works are 
done. try to get all the queued callbacks and trigger them.
ACQUIRED : the topic is acquired by the hub server
ACQUIRE_FAIL_RELEASE : topic acquisition is failed due to some reason (such as 
NotEnoughBookiesException), enter topic releasing phase

2) change topic set to a concurrent map *ConcurrentMap<ByteString, 
TopicStatus>* :
this map is used for tracking topic status transition to ensure only one 
acquisition/release for a specific topic is executed at the same time.

3) added HashMap<ByteString, List<Callback<HedwigSocketAddress>>> to queue get 
owner callbacks. added HashMap<ByteString, List<Callback<Void>>> to queue 
release op callbacks.

4) topic status transition in get owner & release op flow:
{noformat}
*             CLAMING_ENQUEUE_CALLBACK
*                  ^ |
*      enqueue     | |
*      callback    | |
*                  | >
*      claim             claim topic
* null -------> CLAMING -------------> CLAMING_GET_CALLBACKS ---------| 
(trigger queued callbacks)
*  ^   |                                                              |-----> 
ACQUIRED
*  |   -------> CHOOSING -------------> CHOOSING_GET_CALLBACKS -------|  |      
  |
*  |   choose      | ^   choose topic                                    |      
  |
*  |               | |                                                   |      
  |  Release
*  |               | |   enqueue callback                                |      
  |   topic
*  |               > |                                                   >      
  |
*  |          CHOOSING_ENQUEUE_CALLBACK                  ACQUIRE_FAILE_RELEASE  
  |
*  |                                                                     |      
  |
*  |                                                                     >      
  >
*  ---------------------------- RELEASE_GET_ENQUEUED_CALLBACKS <--------- 
RELEASING
*                                                                           ^ |
*                                                         enqueue callback  | |
*                                                                           | >
*                                                                
RELEASE_ENQUEUE_CALLBACK
{noformat}

get owner:
# check topic status
## if topic is not existed, go to 2)
## if topic is CLAIMING/CHOOSING, go to 3)
## if topic is ACQUIRED, callback immediately
## else go to 1)
# claim/choose topic: set topic status as CLAIMING/CHOOSING
## if success, go to 4)
## if false, go to 1) to check topic status again
# enqueue callback
# do get owner
## real get owner
## if get owner succeed
### get enqueued callback list
### mark topic as ACQUIRED
### trigger all the queued callbacks
## if failed
### get enqueued callback list
### mark topic as ACQUIRE_FAIL_RELEASE
### enter releasing phase, the queued callback list will be triggered at the 
end of topic-releasing

release: most are same as get owner.

                
      was (Author: hustlmsp):
    A detailed proposal:


1) introduce TopicStatus to record the status change of topic during topic 
acquisition/releasing.
{quote}
/** Acquire Topic with shouldClaim == true **/
CLAIMING,
CLAIMING_ENQUEUE_CALLBACK,
CLAIMING_GET_ENQUEUED_CALLBACKS,

/** Acquire topic with shouldClaim == false **/
CHOOSING,
CHOOSING_ENQUEUE_CALLBACK,
CHOOSING_GET_ENQUEUED_CALLBACKS,

/** Topic is acquired **/
ACQUIRED,

/** Topic is failed during acquisition **/
ACQUIRE_FAIL_RELEASE,

/** Release Topic **/
RELEASING,
RELEASE_ENQUEUE_CALLBACK,
RELEASE_GET_ENQUEUED_CALLBACKS
{quote}

[CLAIM/CHOOSE/RELEASE]ING : the topic is under claim/choose/release status, 
some one got the chance to do actual work.
[CLAIM/CHOOSE/RELEASE]_ENQUEUE_CALLBACK : there is some on doing 
claim/choose/release works, the op tries to queue callback.
[CLAIM/CHOOSE/RELEASE]_GET_ENQUEUED_CALLBACKS : claim/choose/release works are 
done. try to get all the queued callbacks and trigger them.
ACQUIRED : the topic is acquired by the hub server
ACQUIRE_FAIL_RELEASE : topic acquisition is failed due to some reason (such as 
NotEnoughBookiesException), enter topic releasing phase

2) change topic set to a concurrent map *ConcurrentMap<ByteString, 
TopicStatus>* :
this map is used for tracking topic status transition to ensure only one 
acquisition/release for a specific topic is executed at the same time.

3) added HashMap<ByteString, List<Callback<HedwigSocketAddress>>> to queue get 
owner callbacks. added HashMap<ByteString, List<Callback<Void>>> to queue 
release op callbacks.

4) topic status transition in get owner & release op flow:
{quote}
*             CLAMING_ENQUEUE_CALLBACK
*                  ^ |
*      enqueue     | |
*      callback    | |
*                  | >
*      claim             claim topic
* null -------> CLAMING -------------> CLAMING_GET_CALLBACKS ---------| 
(trigger queued callbacks)
*  ^   |                                                              |-----> 
ACQUIRED
*  |   -------> CHOOSING -------------> CHOOSING_GET_CALLBACKS -------|  |      
  |
*  |   choose      | ^   choose topic                                    |      
  |
*  |               | |                                                   |      
  |  Release
*  |               | |   enqueue callback                                |      
  |   topic
*  |               > |                                                   >      
  |
*  |          CHOOSING_ENQUEUE_CALLBACK                  ACQUIRE_FAILE_RELEASE  
  |
*  |                                                                     |      
  |
*  |                                                                     >      
  >
*  ---------------------------- RELEASE_GET_ENQUEUED_CALLBACKS <--------- 
RELEASING
*                                                                           ^ |
*                                                         enqueue callback  | |
*                                                                           | >
*                                                                
RELEASE_ENQUEUE_CALLBACK
{quote}

get owner:
# check topic status
## if topic is not existed, go to 2)
## if topic is CLAIMING/CHOOSING, go to 3)
## if topic is ACQUIRED, callback immediately
## else go to 1)
# claim/choose topic: set topic status as CLAIMING/CHOOSING
## if success, go to 4)
## if false, go to 1) to check topic status again
# enqueue callback
# do get owner
## real get owner
## if get owner succeed
### get enqueued callback list
### mark topic as ACQUIRED
### trigger all the queued callbacks
## if failed
### get enqueued callback list
### mark topic as ACQUIRE_FAIL_RELEASE
### enter releasing phase, the queued callback list will be triggered at the 
end of topic-releasing

release: most are same as get owner.

                  
> ServerRedirectLoopException when a machine (hosts bookie server & hub server) 
> reboot, which is caused by race condition of topic manager
> ----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: BOOKKEEPER-69
>                 URL: https://issues.apache.org/jira/browse/BOOKKEEPER-69
>             Project: Bookkeeper
>          Issue Type: Bug
>          Components: hedwig-client, hedwig-server
>    Affects Versions: 3.4.0
>         Environment: 3 machines (perf8, perf9, perf10), each machine hosts a 
> bookie server & a hub server.
> perf8 is used as default server for client 1. perf9 is used as default server 
> for client 2.
> bookkeeper is configured as below:
> ensemble size is 3, quorum size is 2.
>            Reporter: Sijie Guo
>            Priority: Critical
>         Attachments: bookkeeper-69-testcase.patch
>
>
> 1) machine perf10 is rebooted. the bookie server & hub server are not 
> restarted automatically after reboot.
> 2) client 1 & client 2 are still running. the topics owned in perf10 will be 
> re-assigned to perf8/perf9. but they would fail because not enough bookie 
> servers are available.
> 3) after 2 hours, we found that perf10 is rebooted. we restarted bookie 
> server & hub server on perf10
> 4) then we got ServerRedirectLoopException in client.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (BOOKKEEPER-69) ServerRedirectLoopException when a machine (hosts bookie server & hub server) reboot, which is caused by race condition of topic manager

Reply via email to