[
https://issues.apache.org/jira/browse/HBASE-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15767723#comment-15767723
]
Andrew Purtell edited comment on HBASE-17341 at 12/21/16 6:13 PM:
------------------------------------------------------------------
Since Ted committed this I will pick to 0.98 now.
I missed it if there was an announcement that branch-1.3 is closed. I committed
another of Vincent's replication fixes there yesterday. We should probably
commit this one too now that the deed has been done.
was (Author: apurtell):
Since Ted committed this I will pick to 0.98 now and resolve.
I missed it if there was an announcement that branch-1.3 is closed. I committed
another of Vincent's replication fixes there yesterday. We should probably
commit this one too now that the deed has been done.
> Add a timeout during replication endpoint termination
> -----------------------------------------------------
>
> Key: HBASE-17341
> URL: https://issues.apache.org/jira/browse/HBASE-17341
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.0.0, 1.3.0, 1.4.0, 1.1.7, 0.98.23, 1.2.4
> Reporter: Vincent Poon
> Assignee: Vincent Poon
> Priority: Critical
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-17341.branch-1.1.v1.patch,
> HBASE-17341.branch-1.1.v2.patch, HBASE-17341.master.v1.patch,
> HBASE-17341.master.v2.patch
>
>
> In ReplicationSource#terminate(), a Future is obtained from
> ReplicationEndpoint#stop(). Future.get() is then called, but can potentially
> hang there if something went wrong in the endpoint stop().
> Hanging there has serious implications, because the thread could potentially
> be the ZK event thread (e.g. watcher calls
> ReplicationSourceManager#removePeer() -> ReplicationSource#terminate() ->
> blocked). This means no other events in the ZK event queue will get
> processed, which for HBase means other ZK watches such as replication watch
> notifications, snapshot watch notifications, even RegionServer shutdown will
> all get blocked.
> The short term fix addressed here is to simply add a timeout for
> Future.get(). But the severe consequences seen here perhaps suggest a
> broader refactoring of the ZKWatcher usage in HBase is in order, to protect
> against situations like this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)