Yu Li created HBASE-20156:

             Summary: Allow regionserver to live during HDFS failure
                 Key: HBASE-20156
                 URL: https://issues.apache.org/jira/browse/HBASE-20156
             Project: HBase
          Issue Type: New Feature
            Reporter: Yu Li

Currently if something is wrong with HDFS, for example NN fencing or get into 
safe mode, RS will abort itself immediately after detecting it (such as log 
roll or flush fail). And if we have a large scale cluster with dense writing 
workload, there will be a huge amount of WAL to split and replay when HDFS is 
back, and the recovery time might be tens of minutes or even hours (actually we 
experienced this more than once in production, there're always some surprise 
like unstable power supply for NN that we never expected...).

Here we propose to add an option to allow RS not aborting during HDFS failure, 
instead we will throw exceptions to clients indicating we're out of service, 
while we could get recovered right after HDFS is back.

This will also make it possible to restart HDFS in some extreme case, and allow 
us to survive if anything wrong happened during HDFS upgrading.

This message was sent by Atlassian JIRA

Reply via email to