[ 
https://issues.apache.org/jira/browse/HBASE-16138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15353625#comment-15353625
 ] 

churro morales commented on HBASE-16138:
----------------------------------------

The cluster shutdown situation currently has problems as well.  When you 
shutdown currently you need to make sure you that replication table is the last 
to go down.  I think both the cluster startup and the cluster shutdown 
situations need more discussion and design.  Maybe now that we have more system 
tables in hbase land, we should have a larger discussion about design and what 
these tables mean to the reliability of hbase.  

side note, we already have a SystemTableWALEntryFilter that removes these 
entries from replication.  Maybe #2 might not be so bad after all.  But with 
more and more system tables popping up maybe this particular problem casts a 
wider net than just this feature.  



> Cannot open regions after non-graceful shutdown due to deadlock with 
> Replication Table
> --------------------------------------------------------------------------------------
>
>                 Key: HBASE-16138
>                 URL: https://issues.apache.org/jira/browse/HBASE-16138
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Replication
>            Reporter: Joseph
>            Assignee: Joseph
>            Priority: Critical
>
> If we shutdown an entire HBase cluster and attempt to start it back up, we 
> have to run the WAL pre-log roll that occurs before opening up a region. Yet 
> this pre-log roll must record the new WAL inside of ReplicationQueues. This 
> method call ends up blocking on 
> TableBasedReplicationQueues.getOrBlockOnReplicationTable(), because the 
> Replication Table is not up yet. And we cannot assign the Replication Table 
> because we cannot open any regions. This ends up deadlocking the entire 
> cluster whenever we lose Replication Table availability. 
> There are a few options that we can do, but none of them seem very good:
> 1. Depend on Zookeeper-based Replication until the Replication Table becomes 
> available
> 2. Have a separate WAL for System Tables that does not perform any replication
> 3. Record the WAL log in the ReplicationQueue asynchronously (don't block 
> opening a region on this event), which could lead to inconsistent Replication 
> state
> Does anyone have any suggestions/ideas/feedback?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to