[
https://issues.apache.org/jira/browse/HBASE-16138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353625#comment-15353625
]
churro morales commented on HBASE-16138:
----------------------------------------
The cluster shutdown situation currently has problems as well. When you
shutdown currently you need to make sure you that replication table is the last
to go down. I think both the cluster startup and the cluster shutdown
situations need more discussion and design. Maybe now that we have more system
tables in hbase land, we should have a larger discussion about design and what
these tables mean to the reliability of hbase.
side note, we already have a SystemTableWALEntryFilter that removes these
entries from replication. Maybe #2 might not be so bad after all. But with
more and more system tables popping up maybe this particular problem casts a
wider net than just this feature.
> Cannot open regions after non-graceful shutdown due to deadlock with
> Replication Table
> --------------------------------------------------------------------------------------
>
> Key: HBASE-16138
> URL: https://issues.apache.org/jira/browse/HBASE-16138
> Project: HBase
> Issue Type: Sub-task
> Components: Replication
> Reporter: Joseph
> Assignee: Joseph
> Priority: Critical
>
> If we shutdown an entire HBase cluster and attempt to start it back up, we
> have to run the WAL pre-log roll that occurs before opening up a region. Yet
> this pre-log roll must record the new WAL inside of ReplicationQueues. This
> method call ends up blocking on
> TableBasedReplicationQueues.getOrBlockOnReplicationTable(), because the
> Replication Table is not up yet. And we cannot assign the Replication Table
> because we cannot open any regions. This ends up deadlocking the entire
> cluster whenever we lose Replication Table availability.
> There are a few options that we can do, but none of them seem very good:
> 1. Depend on Zookeeper-based Replication until the Replication Table becomes
> available
> 2. Have a separate WAL for System Tables that does not perform any replication
> 3. Record the WAL log in the ReplicationQueue asynchronously (don't block
> opening a region on this event), which could lead to inconsistent Replication
> state
> Does anyone have any suggestions/ideas/feedback?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)