zhaodongzhi opened a new pull request, #1774:
URL: https://github.com/apache/incubator-brpc/pull/1774

   Issue #1773:Fix brpc client cannot reconnect to server probabilistically 
when the server restarts
   
   Problem
   -------
   
   Server restart will trigger Socket::SetFailed -> Socket check healthy
   -> Socket::WaitAndReset -> Reconnect to server -> Socket::Revive
   on the brpc client side
   
   Socket::Revive
   -> _versioned_ref.cas(vref, <id_ver, nref + 1>) // step 1
      -> _recycle_flag = false // step 2
   
   After step 1 is successfully executed, the Socket can be acquired by 
Socket::Address
   and the gap between 1 and 2 will cause the ref count  in _versioned_ref to 
not be
   decreased in Socket::ReleaseAdditionalReference since _recycle_flag is true
   Finally Socket::WaitAndReset can never wait for the expected ref count to 
mistakenly
   think that there are still another requests holding socket references
   and fall into an infinite loop
   
   Example:
   
   Session1(Thread1)                                   Session2(Thread2)
   
   Socket::Revive
   T1: _versioned_ref.cas(vref, <id_ver, nref + 1>)
                                                       T2: Socker::Address 
success
                                                       T3: Server restart or 
network error happened
                                                       T4: Socket::SetFailed
                                                       T5: 
Socket::ReleaseAdditionalReference
   // Now _recycle_flag is true                        T6: 
_recycle_flag.cas(false, true)
                                                           -> failed
                                                              -> 
Socket::Dereference
   T7: _recycle_flag = false
                                                       T8: Socket::WaitAndReset
                                                           -> nref always be 
three, not excepted two
                                                              -> dead loop. 
cannot reconnected to server
   
   Solution
   --------
   Add _recycle_mutex to avoid this race condition
   
   Socket::Revive and Socket::ReleaseAdditionalReference will only be called
   when some exceptions occur in the socket, so this mutex will not cause
   performance degradation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to