[ 
https://issues.apache.org/jira/browse/HBASE-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Purtell updated HBASE-17341:
-----------------------------------
       Resolution: Fixed
    Fix Version/s: 0.98.24
                   1.1.8
                   1.2.5
                   1.3.0
           Status: Resolved  (was: Patch Available)

> Add a timeout during replication endpoint termination
> -----------------------------------------------------
>
>                 Key: HBASE-17341
>                 URL: https://issues.apache.org/jira/browse/HBASE-17341
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 0.98.23, 1.2.4
>            Reporter: Vincent Poon
>            Assignee: Vincent Poon
>            Priority: Critical
>             Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.5, 1.1.8, 0.98.24
>
>         Attachments: HBASE-17341.branch-1.1.v1.patch, 
> HBASE-17341.branch-1.1.v2.patch, HBASE-17341.master.v1.patch, 
> HBASE-17341.master.v2.patch
>
>
> In ReplicationSource#terminate(), a Future is obtained from 
> ReplicationEndpoint#stop().  Future.get() is then called, but can potentially 
> hang there if something went wrong in the endpoint stop().
> Hanging there has serious implications, because the thread could potentially 
> be the ZK event thread (e.g. watcher calls 
> ReplicationSourceManager#removePeer() -> ReplicationSource#terminate() -> 
> blocked).  This means no other events in the ZK event queue will get 
> processed, which for HBase means other ZK watches such as replication watch 
> notifications, snapshot watch notifications, even RegionServer shutdown will 
> all get blocked.
> The short term fix addressed here is to simply add a timeout for 
> Future.get().  But the severe consequences seen here perhaps suggest a 
> broader refactoring of the ZKWatcher usage in HBase is in order, to protect 
> against situations like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to